256 15.HighPerformanceProgrammingwithDataOrientedDesign
Finally, the most important step is to look at the data you’ve identified as
input and figure out how it can be transformed into the output data in an efficient
way. How does the input data need to be arranged? Normally, you’ll want a large
block of the same data type, but perhaps, if there are two data types that need to
be processed at the same time, interleaving them might make more sense. Or
maybe, the transformation needs two separate passes over the same data type, but
the second pass uses some fields that are unused in the first pass. In that case, it
might be a good candidate for splitting it up into two types and keeping each of
them sequentially in memory.
Once you have decided on the transformation, the only thing left is gathering
the inputs from the rest of the system and filling the outputs. When you’re transi-
tioning from a more traditional architecture, you might have to perform an ex-
plicit gathering step—query some functions or objects and collect the input data
in the format you want. You’ll have to perform a similar operation with the out-
put, feeding it into the rest of the system. Even though those extra steps represent
a performance hit, the benefits gained usually offset any performance costs. As
more systems start using the data-oriented approach, you’ll be able to feed the
output data from one system directly into the input of another, and you’ll really
be able to reap the benefits.
15.5RealWorldSituations
Homogeneous,SequentialData
You are probably already applying some of the principles of data-oriented design
in your games right now: the particle system. It’s intended to handle thousands
and thousands of particles. The input data is very carefully designed to be small,
aligned, and fit in cache lines, and the output data is also very well defined be-
cause it probably feeds directly into the GPU.
Unfortunately for us, we don’t have that many situations in game develop-
ment where we can apply the same principle. It may be possible for some sound
or image processing, but most other tasks seem to require much more varied data
and lots of different code paths.
HeterogeneousData
Game entities are the perfect example of why the straightforward particle ap-
proach doesn’t work in other game subsystems. You probably have dozens of
different game entity types. Or maybe you have one game entity, but have doz-
ens, or even hundreds, of components that, when grouped together, give entities
their own behavior.
15.5RealWorldSituations 257
One simple step we can take when dealing with large groups of heterogene-
ous data like that is to group similar data types together. For example, we would
lay out all the health components for all entities in the game one right after the
other in the same memory block. Same thing with armor components, and every
other type of component.
If you just rearranged them and still updated them one game entity at a time,
you wouldn’t see any performance improvements. To gain a significant perfor-
mance boost, you need to change the update from being entity-centric to being
component-centric. You need to update all health components first, then all ar-
mor components, and proceed with all component types. At that point, your
memory access patterns will have improved significantly, and you should be able
to see much better performance.
BreakandBatch
It turns out that sometimes even updating a single game entity component seems
to need a lot of input data, and sometimes it’s even unpredictable what it’s going
to need. That’s a sign that we need to break the update into multiple steps, each
of them with smaller, more predictable input data.
For example, the navigation component casts several rays into the world.
Since the ray casts happen as part of the update, all of the data they touch is con-
sidered input data. In this case, it means that potentially the full world and other
entities’ collision data are part of the input data! Instead of collecting that data
ahead of time and feeding it as an input into each component update, we can
break the component update into two parts. The initial update figures out what
ray casts are required and generates ray-cast queries as part of its output. The se-
cond update takes the results from the ray casts requested by the first update and
finishes updating the component state.
The crucial step, once again, is the order of the updates. What we want to do
is perform the first update on all navigation components and gather all the ray-
cast queries. Then we take all those queries, cast all those rays, and save the re-
sults. Finally, we do another pass over the navigation components, calling the
second update for each of them and feeding them the results of the ray queries.
Notice how once again, we managed to take some code with that tree-like
memory access structure and turn it into something that is more linear and works
over similar sets of data. The ray-casting step isn’t the ideal linear traversal, but
at least it’s restricted to a single step, and maybe the world collision data might
fit in some cache so we won’t be getting too many misses to main memory.
Once you have this implemented, if the ray-casting part is still too slow, you
could analyze the data and try to speed things up. For example, if you often have
258 15.HighPerformanceProgrammingwithDataOrientedDesign
lots of grouped ray casts, it might be beneficial to first sort the ray casts spatially,
and when they’re being resolved, you’re more likely to hit data that is already
cached.
ConditionalExecution
Another common situation is that not all data of the same type needs to be updat-
ed the same way. For example, the navigation component doesn’t always need to
cast the same number of rays. Maybe it normally casts a few rays every half a
second, or more rays if other entities are closer by.
In that case, we can let the component decide whether it needs a second-pass
update by whether it creates a ray-cast query. Now we’re not going to have a
fixed number of ray queries per entity, so we’ll also need a way to make sure we
associate the ray cast with the entity it came from.
After all ray casts are performed, we iterate over the navigation components
and only update the ones that requested a ray query. That might save us a bit of
CPU time, but chances are that it won’t improve performance very much because
we’re going to be randomly skipping components and missing out on the benefits
of accessing memory linearly.
If the amount of data needed by the second update is fairly small, we could
copy that data as an output for the first update. That way, whenever we’re ready
to perform the second update, we only need to access the data generated this way,
which is sequentially laid out in memory.
If copying the data isn’t practical (there’s either too much data or that data
needs to be written back to the component itself), we could exploit temporal co-
herence, if there is any. If components either cast rays or don’t, and do so for
several frames at a time, we could reorder the components in memory so all nav-
igation components that cast rays are found at the beginning of the memory
block. Then, the second update can proceed linearly through the block until the
last component that requested a ray cast is updated. To be able to achieve this, we
need to make sure that our data is easily relocatable.
Polymorphism
Whenever we’re applying data-oriented design, we explicitly traverse sets of data
of a known type. Unlike an object-oriented approach, we would never traverse a
set of heterogeneous data by calling polymorphic functions in each of them.
Even so, while we’re transforming some well-known data, we might need to
treat another set of data polymorphically. For example, even though we’re updat-
ing the bullet data (well-known type), we might want to deliver damage to any
15.6Parallelization 259
entity it hits, independent of the type of that entity. Since using classes and inher-
itance is not usually a very data-friendly approach, we need to find a better alter-
native.
There are many different ways to go about this, depending on the kind of
game architecture you have. One possibility is to split the common functionality
of a game entity into a separate data type. This would probably be a very small
set of data: a handle, a type, and possibly some flags or indices to components. If
every entity in the world has one corresponding set of data of this type, we can
always count on it while dealing with other entities. In this case, the bullet data
update could check whether the entity has a damage-handling component, and if
so, access it and deliver the damage.
If that last sentence left you a bit uncomfortable, congratulations, you’re
starting to really get a feel for good data access patterns. If you analyze it, the
access patterns are less than ideal: we’re updating all the current bullets in the
world. That’s fine because they’re all laid out sequentially in memory. Then,
when one of them hits an entity, we need to access that entity’s data, and then
potentially the damage-handling component. That’s two potentially random ac-
cesses into memory that are almost guaranteed to be cache misses.
We could improve on this a little bit by having the bullet update not access
the entity directly and, instead, create a message packet with the damage it wants
to deliver to that entity. After we’re done updating all of the bullets, we can make
another pass over those messages and apply the damage. That might result in a
marginal improvement (it’s doubtful that accessing the entity and its component
is going to cause any cache misses on the following bullet data), but most im-
portantly, it prevents us from having any meaningful interaction with the entity.
Is the entity bullet proof? Maybe that kind of bullet doesn’t even hit the entity,
and the bullet should go through unnoticed? In that case, we really want to access
the entity data during the bullet update.
In the end, it’s important to realize that not every data access is going to be
ideal. Like with all optimizations, the most important ones are the ones that hap-
pen more frequently. A bullet might travel for hundreds of frames, and it will hit
something at most in one frame. It’s not going to make much of a difference if,
during the frame when it hits, we have a few extra memory accesses.
15.6Parallelization
Improving memory access patterns is only part of the performance benefits pro-
vided by data-oriented design. The other half is being able to take advantage of
multiple cores very easily.
260 15.HighPerformanceProgrammingwithDataOrientedDesign
Normally, to split up tasks on different cores, we need to create a description
of the job to perform and some of the inputs to the job. Unfortunately, where
things fall down for procedural or object-oriented approaches is that a lot of tasks
have implicit inputs: world data, collision data, etc. Developers try to work
around it by providing exclusive access to data through locking systems, but
that’s very error prone and can be a big hit on performance depending on how
frequently it happens.
The good news is that once you’ve architected your code such that you’re
thinking about the data first and following the guidelines in earlier sections,
you’re ready to parallelize it with very little effort. All of your inputs are clearly
defined, and so are your outputs. You also know which tasks need to be per-
formed before other tasks based on which data they consume and produce.
The only part missing is a scheduler. Once you have all of that information
about your data and the transformations that need to happen to it, the scheduler
can hand off tasks to individual cores based on what work is available. Each core
gets the input data, the address where the output data should go, and what kind of
transformation to apply to it.
Because all inputs are clearly defined, and outputs are usually new memory
buffers, there is often no need to provide exclusive access to any data. Whenever
the output data writes back into an area of memory that was used as an input (for
example, entity states), the scheduler can make sure there are no jobs trying to
read from that memory while the job that writes the output is executing. And be-
cause each job is very “shallow” (in the sense that it doesn’t perform cascading
function calls), the data each one touches is very limited, making the scheduling
relatively easy.
If the entire game has been architected this way, the scheduler can then cre-
ate a complete directed acyclic graph of data dependencies between jobs for each
frame. It can use that information and the timings from each previous frame (as-
suming some temporal coherency) and estimate what the critical path of data
transformation is going to be, giving priority to those jobs whenever possible.
One of the consequences of running a large system of data transformations
this way is that there are often a lot of intermediate memory buffers. For plat-
forms with little memory available, the scheduler can trade some speed for extra
memory by giving priority to jobs that consume intermediate data instead of ones
in the critical path.
This approach works well for any kind of parallel architecture, even if the
individual cores don’t have access to main memory. Since all of the input data is
explicitly listed, it can be easily transferred to the core local memory before the
transformation happens.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.251.57