Chapter 10: Advanced Rendering Techniques and Optimizations

In this chapter, we will touch on a few more advanced topics, as well as show you how to integrate multiple effects and techniques into a single graphical application.

This chapter will cover the following recipes:

  • Doing frustum culling on the CPU
  • Doing frustum culling on the GPU with compute shaders
  • Implementing order-independent transparency
  • Loading texture assets asynchronously
  • Implementing projective shadows for directional lights
  • Using GPU atomic counters in Vulkan
  • Putting it all together into an OpenGL demo

Technical requirements

To run the recipes in this chapter, you need to use a computer with a video card that supports OpenGL 4.6 with ARB_bindless_texture and Vulkan 1.2. Read Chapter 1, Establishing a Build Environment, if you want to learn how to build the demo applications shown in this book.

This chapter relies on the geometry loading code that was explained in Chapter 7, Graphics Rendering Pipeline, and the lightweight OpenGL rendering queues we described in Chapter 9, Working with Scene Graphs. Make sure you read both these chapters before proceeding any further.

You can find the code files present in this chapter on GitHub at https://github.com/PacktPublishing/3D-Graphics-Rendering-Cookbook/tree/master/Chapter10.

Doing frustum culling on the CPU

Frustum culling is used to determine whether a part of our scene is visible from a viewing frustum. There are many tutorials on the internet that show how to do this. However, most of them have a significant drawback.

Many frustum culling examples, as was pointed out by Inigo Quilez in his blog http://www.iquilezles.org/www/articles/frustumcorrect/frustumcorrect.htm, end up checking if a mesh is outside the viewing frustum by comparing it with six viewing frustum planes. The mesh's axis-aligned bounding box (AABB) is used for this. Then, anything that lies on the outer side of any of these planes is rejected. This approach will produce false positives when a big enough AABB, which is not visible, intersects some of the frustum planes. The naive approach will accept these AABBs as visible, thereby reducing the culling efficiency. If we are talking about culling individual meshes, taking care of such cases may not be worth it performance-wise. However, if we cull big boxes that are containers of some sort, such as parts of an oct-tree, false positives may become a significant problem. The solution is to add a reversed culling test where eight frustum corner points are tested against six planes of an AABB.

In this recipe, we will show you some self-contained frustum culling code in C++ and teach you how to use it to render the Bistro scene.

Getting ready

Make sure you have checked the Implementing lightweight rendering queues in OpenGL recipe in Chapter 9, Working with Scene Graphs, so that you know how the new GLMesh class operates.

The example source code for this demo can be found in Chapter10/GL01_CullingCPU.

How to do it…

To implement both box-in-frustum and frustum-in-box tests, we must extract six frustum planes and eight frustum corner points from our view matrix. Here's the code to do so:

  1. By default, GLM uses a right-handed coordinate system. This way, six frustum planes are extracted from the transposed, pre-multiplied view-projection matrix, vp:

    void getFrustumPlanes(glm::mat4 vp, glm::vec4* planes) {

      using glm::vec4;

      vp = glm::transpose(vp);

      planes[0] = vec4(vp[3] + vp[0]); // left

      planes[1] = vec4(vp[3] - vp[0]); // right

      planes[2] = vec4(vp[3] + vp[1]); // bottom

      planes[3] = vec4(vp[3] - vp[1]); // top

      planes[4] = vec4(vp[3] + vp[2]); // near

      planes[5] = vec4(vp[3] - vp[2]); // far

    }

  2. The eight frustum corner points can be produced by taking a unit cube and transforming it with the inverse of the pre-multiplied view-projection matrix and performing perspective division:

    void getFrustumCorners(  glm::mat4 vp, glm::vec4* points)

    {

      using glm::vec4;

      const vec4 corners[] = {    vec4(-1, -1, -1, 1), vec4( 1, -1, -1, 1),    vec4( 1,  1, -1, 1), vec4(-1,  1, -1, 1),    vec4(-1, -1,  1, 1), vec4( 1, -1,  1, 1),    vec4( 1,  1,  1, 1), vec4(-1,  1,  1, 1)  };

      const glm::mat4 invVP = glm::inverse(vp);

      for (int i = 0; i != 8; i++) {

        const vec4 q = invVP * corners[i];

        points[i] = q / q.w;

      }

    }

  3. Now, the culling code is straightforward. Let's check if a bounding box is fully outside any of the six frustum planes:

    bool isBoxInFrustum(glm::vec4* frPlanes,                    glm::vec4* frCorners,                    const BoundingBox& b)

    {

      using glm::dot;

      using glm::vec4;

      for ( int i = 0; i < 6; i++ ) {

         int r = 0;

         r +=(dot(frPlanes[i],       vec4(b.min_.x,b.min_.y,b.min_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.max_.x,b.min_.y,b.min_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.min_.x,b.max_.y,b.min_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.max_.x,b.max_.y,b.min_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.min_.x,b.min_.y,b.max_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.max_.x,b.min_.y,b.max_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.min_.x,b.max_.y,b.max_.z,1.f))<0) ? 1:0;

         r +=(dot(frPlanes[i],       vec4(b.max_.x,b.max_.y,b.max_.z,1.f))<0) ? 1:0;

         if (r == 8) return false;

      }

    Note

    Whether something is "inside" or "outside" of a plane is determined by the direction of its normal vector.

  4. Then, we can check if the same frustum is inside the box:

      int r = 0;

      r = 0; for (int i = 0; i<8; i++)    r+=((frCorners[i].x>b.max_.x) ? 1:0);

      if (r == 8) return false;

      r = 0; for (int i = 0; i<8; i++)    r+=((frCorners[i].x<b.min_.x) ? 1:0);

      if (r == 8) return false;

      r = 0; for (int i = 0; i<8; i++)    r+=((frCorners[i].y>b.max_.y) ? 1:0);

      if (r == 8) return false;

      r = 0; for (int i = 0; i<8; i++)    r+=((frCorners[i].y<b.min_.y) ? 1:0);

      if (r == 8) return false;

      r = 0; for (int i = 0; i<8; i++)    r+=((frCorners[i].z>b.max_.z) ? 1:0);

      if (r == 8) return false;

      r = 0; for (int i = 0; i<8; i++)    r+=((frCorners[i].z<b.min_.z) ? 1:0);

      if (r == 8) return false;

      return true;

    }

Let's take a look at how this culling code can be used in our demos. The main driving code is in the Chapter10/GL01_CullingCPU/src/main.cpp file:

  1. Our per-frame uniform buffer is unified across all the demos in this chapter because we are using a single set of shaders to render most of the Bistro scene. That's why we have light and cameraPos here. The GLSL code will check if light[3][3] is zero and skip shadow rendering if so:

    struct PerFrameData {

      mat4 view;

      mat4 proj;

      mat4 light = mat4(0.0f); // unused in this demo

      vec4 cameraPos;

    };

  2. We have a standard camera positioner with a camera, which is used to initialize our view matrix for culling. In this demo, we will allow users to "freeze" a culling frustum and fly around to observe which objects are being culled. All other global variables are used to control the rendering from ImGui:

    CameraPositioner_FirstPerson positioner(  vec3(-10.0f, 3.0f, 3.0f), vec3(0.0f, 0.0f, -1.0f),  vec3(0.0f, 1.0f, 0.0f));

    Camera camera(positioner);

    mat4 g_CullingView = camera.getViewMatrix();

    bool g_FreezeCullingView = false;

    bool g_DrawMeshes = true;

    bool g_DrawBoxes = true;

    bool g_DrawGrid = true;

  3. The main() function starts by loading a set of shaders. We will reuse the grid rendering code from the Implementing an infinite grid GLSL shader recipe of Chapter 5, Working with Geometry Data. The GL01_scene_IBL.vert and GL01_scene_IBL.frag scene rendering shaders are universal for all the demos in this chapter. They are rather similar to the ones we used in the previous chapter, except for the shadow mapping part, which will be discussed later in this chapter, in the Implementing projective shadows for directional lights recipe:

    int main(void) {

      GLApp app;

      GLShader shdGridVert(    "data/shaders/chapter05/GL01_grid.vert");

      GLShader shdGridFrag(    "data/shaders/chapter05/GL01_grid.frag");

      GLProgram progGrid(shdGridVert, shdGridFrag);

      GLShader shaderVert(    "data/shaders/chapter10/GL01_scene_IBL.vert");

      GLShader shaderFrag(    "data/shaders/chapter10/GL01_scene_IBL.frag");

      GLProgram program(shaderVert, shaderFrag);

  4. Create a buffer for our per-frame uniforms and configure some global OpenGL states:

      const GLsizeiptr kUniformBufferSize =    sizeof(PerFrameData);

      GLBuffer perFrameDataBuffer(    kUniformBufferSize, nullptr,    GL_DYNAMIC_STORAGE_BIT);

      glBindBufferRange(GL_UNIFORM_BUFFER,    kBufferIndex_PerFrameUniforms,    perFrameDataBuffer.getHandle(), 0,    kUniformBufferSize);

      glClearColor(1.0f, 1.0f, 1.0f, 1.0f);

      glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);

    Instead of keeping two separate GLSceneData instances for the external and internal meshes of the Bistro scene, we have merged everything into one big mesh. This is done in the SceneConverter tool, which we covered in Chapter 7, Graphics Rendering Pipeline. Furthermore, the original Bistro scene contains a tree with thousands of leaves in it, represented as separate meshes. Our merging code compacts separate meshes for leaves into one single mesh. The code is quite specific for the Bistro scene because it uses explicit names for the materials that the meshes should be merged with. Check the mergeBistro() function in Chapter7/SceneConverter/src/main.cpp for more details. The GLMesh class here comes from Chapter9/GLMesh9.h and was discussed in the previous chapter. Check the Implementing lightweight rendering queues in OpenGL recipe for more details:

      GLSceneData sceneData(    "data/meshes/bistro_all.meshes",    "data/meshes/bistro_all.scene",    "data/meshes/bistro_all.materials");

      GLMesh mesh(sceneData);

  5. We will skip the code that sets up glfwSetCursorPosCallback(), glfwSetMouseButtonCallback(), and glfwSetKeyCallback() here and get straight to the rendering-related code. The GLSkyboxRenderer class is declared in Chapter9/GLSkyboxRenderer.h and is used to render the skybox and store light probes for radiance and irradiance, as well as the BRDF LUT texture. The ImGuiGLRenderer class was described in the Rendering a basic UI with Dear ImGui recipe of Chapter 2, Using Essential Libraries. The minimalist implementation we are providing here can be found in the shared/UtilsGLImGui.h file:

      GLSkyboxRenderer skybox;

      ImGuiGLRenderer rendererUI;

      CanvasGL canvas;

    Now, we have reached the part that's related to frustum culling. We must store bounding boxes for each shape object in our scene data. Considering that the entire scene geometry is static, we can pre-transform all the boxes into world space using the corresponding model-to-world matrices.

  6. One big bounding box, called fullScene, which encloses the entire scene, will be used to render debug information:

      for (const auto& c : sceneData.shapes_)  {

        mat4 model = sceneData.      scene_.globalTransform_[c.transformIndex];

        sceneData.meshData_.      boxes_[c.meshIndex].transform(model);

      }

      BoundingBox fullScene =    combineBoxes(sceneData.meshData_.boxes_);

  7. The main loop goes through the traditional camera position update, sets up the OpenGL viewport, and clears the framebuffer. Once we have all these values, we can fill in the PerFrameData structure and upload it to the OpenGL buffer:

      while (!glfwWindowShouldClose(app.getWindow()))  {

         positioner.update(       app.getDeltaSeconds(), mouseState.pos,       mouseState.pressedLeft);

         int width, height;

         glfwGetFramebufferSize(       app.getWindow(), &width, &height);

         const float ratio = width / (float)height;

         glViewport(0, 0, width, height);

         glClearNamedFramebufferfv(0, GL_COLOR, 0,       glm::value_ptr(vec4(0.0f, 0.0f, 0.0f, 1.0f)));

         glClearNamedFramebufferfi(       0, GL_DEPTH_STENCIL, 0, 1.0f, 0);

         const mat4 proj = glm::perspective(       45.0f, ratio, 0.1f, 1000.0f);

         const mat4 view = camera.getViewMatrix();

         const PerFrameData perFrameData = {       .view = view,       .proj = proj,       .light = mat4(0.0f),       .cameraPos =         glm::vec4(camera.getPosition(), 1.0f) };

         glNamedBufferSubData(       perFrameDataBuffer.getHandle(), 0,       kUniformBufferSize, &perFrameData);

  8. The culling frustum is separate from our camera's viewing frustum. This is very handy for debugging and pure demonstration. 6 frustum planes and 8 corners are extracted from the pre-multiplied proj*g_CullingView matrix:

         if (!g_FreezeCullingView)

            g_CullingView = camera.getViewMatrix();

         vec4 frustumPlanes[6];

         getFrustumPlanes(       proj * g_CullingView, frustumPlanes);

         vec4 frustumCorners[8];

         getFrustumCorners(       proj * g_CullingView, frustumCorners);

  9. Here goes the culling loop. Let's iterate over all the drawing commands in the indirect buffer, get the appropriate bounding box for each one, and check its visibility. Instead of manipulating the order of the draw commands in the buffer, we will set the number of instances to 0 for any object that was culled. Once the new content of the indirect buffer has been uploaded to the GPU, we are ready to render the culled scene:

         int numVisibleMeshes = 0;

         {

           DrawElementsIndirectCommand* cmd =         mesh.bufferIndirect_.drawCommands_.data();

           for (const auto& c : sceneData.shapes_) {

             cmd->instanceCount_ = isBoxInFrustum(           frustumPlanes, frustumCorners,           sceneData.meshData_.           boxes_[c.meshIndex]) ? 1 : 0;

             numVisibleMeshes += (cmd++)->instanceCount_;

           }

           mesh.bufferIndirect_.uploadIndirectBuffer();

         }

  10. Let's do some debug drawing before we render the actual scene. We will render a red bounding box for every invisible mesh and a green box for visible ones. The fullScene box shows the bounds of the entire scene:

         if (g_DrawBoxes) {

            DrawElementsIndirectCommand* cmd =          mesh.bufferIndirect_.drawCommands_.data();

            for (const auto& c : sceneData.shapes_)

               drawBox3dGL(canvas, mat4(1.0f),             sceneData.meshData_.boxes_[c.meshIndex],             (cmd++)->instanceCount_ ?             vec4(0,1,0,1) : vec4(1,0,0,1));

            drawBox3dGL(canvas,          mat4(1.0f), fullScene, vec4(1,0,0,1));

         }

  11. The actual scene rendering goes like so. Here, we have a skybox, the Bistro scene, and an optional grid:

         skybox.draw();

         glDisable(GL_BLEND);

         glEnable(GL_DEPTH_TEST);

         if (g_DrawMeshes) {

           program.useProgram();

           mesh.draw(sceneData.shapes_.size());

         }

         if (g_DrawGrid) {

            glEnable(GL_BLEND);

            progGrid.useProgram();

            glDrawArraysInstancedBaseInstance(          GL_TRIANGLES, 0, 6, 1, 0);

         }

  12. If we want to freeze the culling frustum, we must render it with yellow lines. This will help us debug the application and observe how culling behaves in different situations:

         if (g_FreezeCullingView)

            renderCameraFrustumGL(canvas,          g_CullingView, proj, vec4(1, 1, 0, 1), 100);

         canvas.flush();

  13. To conclude the main loop and this entire demo, the remaining ImGui code controls the application parameters, as well as displays the number of visible meshes:

         ImGuiIO& io = ImGui::GetIO();

         io.DisplaySize =       ImVec2((float)width, (float)height);

         ImGui::NewFrame();

         ImGui::Begin("Control", nullptr);

         ImGui::Text("Draw:");

         ImGui::Checkbox("Meshes", &g_DrawMeshes);

         ImGui::Checkbox("Boxes",  &g_DrawBoxes);

         ImGui::Checkbox("Grid",  &g_DrawGrid);

         ImGui::Separator();

         ImGui::Checkbox("Freeze culling frustum (P)",       &g_FreezeCullingView);

         ImGui::Separator();

         ImGui::Text("Visible meshes: %i",       numVisibleMeshes);

         ImGui::End();

         ImGui::Render();

         rendererUI.render(       width, height, ImGui::GetDrawData());

         app.swapBuffers();

      }

      return 0;

    }

Here is a screenshot from the running demo:

Figure 10.1 – Lumberyard Bistro scene with CPU frustum culling

Figure 10.1 – Lumberyard Bistro scene with CPU frustum culling

The culling frustum, rendered with yellow lines, is frozen so that we can observe its orientation. The green boxes highlight visible meshes, while the red boxes are for invisible ones. Each time the P key is pressed, the culling frustum is realigned with the view of the camera.

There's more...

You might be thinking that this kind of culling is inefficient because modern GPUs can render small meshes much faster than we can cull them on the CPU. This is mostly true. It does not make sense to cull small things such as bottles or separate leaves this way. However, the CPU culling pipeline still matters if we want to cull big clusters of objects. For example, if we create a 3D grid that covers the entire Bistro scene and has reasonably big cells, we can assign each mesh to cells that intersect it. Now, CPU culling can be applied to each cell instead of individual meshes, completely skipping its entire content. We leave this approach as a moderately complex exercise for you.

Let's try to improve things a bit and move the entire frustum culling pipeline completely to the GPU via compute shaders.

Doing frustum culling on the GPU with compute shaders

In the previous recipe, we learned how to cull invisible meshes on the CPU in a classic and somewhat old-fashioned way. As GPUs grew more and more performant, this CPU-based approach became very inefficient. Let's learn how to utilize compute shaders to implement a frustum culling pipeline on a GPU using OpenGL.

The goal of this recipe is to teach you what you can do on modern GPUs with compute pipelines rather than implementing a very performant culling system. Once we are comfortable with the basics, we will discuss the limitations, as well as further directions for improvements.

Getting ready

Make sure you have read the previous recipe, Doing frustum culling on the CPU, so that you're familiar with the frustum culling basics.

The source code for this recipe can be found at Chapter10/GL02_CullingGPU.

How to do it…

Let's take a look at how we can port the C++ culling code from the previous recipe to the GLSL compute shaders:

  1. The main application starts by declaring per-frame uniform data. Besides the first four fields, which are necessary for our unified GLSL shaders, we will add a few more variables that pass frustum parameters to the GLSL code. Similar to the previous recipe, we need 6 frustum planes and 8 frustum corners. The total number of shapes to cull is passed to the shaders to allow proper bounds checking:

    struct PerFrameData {

      mat4 view;

      mat4 proj;

      mat4 light = mat4(0.0f); // unused in this demo

      vec4 cameraPos;

      vec4 frustumPlanes[6];

      vec4 frustumCorners[8];

      uint32_t numShapesToCull;

    };

  2. Let's look at some variables that can control the camera and culling parameters. This time, all the culling happens on the GPU side and we do not read back the culling results to render colored bounding boxes. Hence, we have fewer variables here:

    CameraPositioner_FirstPerson positioner(  vec3(-10.f, 3.f, 3.f), vec3(0.f, 0.f, -1.f),  vec3(0.f, 1.f, 0.f));

    Camera camera(positioner);

    mat4 g_CullingView = camera.getViewMatrix();

    bool g_FreezeCullingView = false;

    bool g_EnableGPUCulling = true;

  3. The main() function structure is similar to the one shown in the previous recipe. Besides the grid and scene shaders, we will load a compute shader called GL02_FrustumCulling.comp, which performs culling:

    int main(void) {

      GLApp app;

      GLShader shdGridVert(    "data/shaders/chapter05/GL01_grid.vert");

     GLShader shdGridFrag(    "data/shaders/chapter05/GL01_grid.frag");

      GLProgram progGrid(shdGridVert, shdGridFrag);

      GLShader shaderVert(    "data/shaders/chapter10/GL01_scene_IBL.vert");

     GLShader shaderFrag(    "data/shaders/chapter10/GL01_scene_IBL.frag");

     GLProgram program(shaderVert, shaderFrag);

     GLShader shaderCulling(   "data/shaders/chapter10/GL02_FrustumCulling.comp");

      GLProgram programCulling(shaderCulling);

  4. To pass the bounding boxes to the culling shader, we need to allocate an OpenGL buffer for them. The culling compute shader uses 3 buffers. Two buffers are used as inputs to read bounding boxes and draw commands. The third one contains a single integer value and is used to read back the number of culled meshes to the CPU side for debugging. The binding points for these buffers go sequentially after kBufferIndex_PerFrameUniforms:

      const GLuintkMaxNumObjects = 128 * 1024;

      const GLsizeiptr kUniformBufferSize =    sizeof(PerFrameData);

      const GLsizeiptr kBoundingBoxesBufferSize =    sizeof(BoundingBox) * kMaxNumObjects;

      const GLuint kBufferIndex_BoundingBoxes =    kBufferIndex_PerFrameUniforms + 1;

      const GLuint kBufferIndex_DrawCommands =    kBufferIndex_PerFrameUniforms + 2;

      const GLuint kBufferIndex_NumVisibleMeshes =    BufferIndex_PerFrameUniforms + 3;

      GLBuffer perFrameDataBuffer(kUniformBufferSize,    nullptr, GL_DYNAMIC_STORAGE_BIT);

      glBindBufferRange(GL_UNIFORM_BUFFER,    kBufferIndex_PerFrameUniforms,    perFrameDataBuffer.getHandle(), 0,    kUniformBufferSize);

      GLBuffer boundingBoxesBuffer(    kBoundingBoxesBufferSize, nullptr,    GL_DYNAMIC_STORAGE_BIT);

  5. The previous recipe implemented all the culling on the CPU, so it was rather trivial to compute the number of visible meshes. The GPU culling is a bit trickier, though; we need to read a value from the GPU buffer back to the CPU to find out how many meshes are visible. Let's create a buffer with a single integer value and map its content to the CPU memory:

      GLBuffer numVisibleMeshesBuffer(sizeof(uint32_t),     nullptr, GL_MAP_READ_BIT | GL_MAP_WRITE_BIT |     GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);

      volatile uint32_t* numVisibleMeshesPtr =    (uint32_t*)glMapNamedBuffer(       numVisibleMeshesBuffer.getHandle(),       GL_READ_WRITE);

      assert(numVisibleMeshesPtr);

    This way, we can access the value using OpenGL synchronization. The integer value that the numVisibleMeshesPtr variable points to is marked as volatile to let the compiler know it is going to be updated from outside the C++ program.

  6. Now, let's look at the code for initializing the OpenGL state and loading the Bistro model. The GLFW setup code, identical to what was shown in the previous recipe, will be skipped again here:

      glClearColor(1.0f, 1.0f, 1.0f, 1.0f);

      glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);

      glEnable(GL_DEPTH_TEST);

      GLSceneData sceneData(    "data/meshes/bistro_all.meshes",     "data/meshes/bistro_all.scene",     "data/meshes/bistro_all.materials");

      GLMesh mesh(sceneData);

      GLSkyboxRenderer skybox;

      ImGuiGLRenderer rendererUI;

      CanvasGL canvas;

      bool g_DrawMeshes = true;

      bool g_DrawGrid = true;

  7. Let's pretransform the bounding boxes to world space. The bounding boxes are stored non-sequentially inside the meshData_ container and require an indirection to access them. Let's reorder this container sequentially to simplify the work for our culling compute shader, which is going to access them later:

      std::vector<BoundingBox> reorderedBoxes;

      reorderedBoxes.reserve(sceneData.shapes_.size());

      for (const auto& c : sceneData.shapes_) {

         mat4 model = sceneData.scene_.       globalTransform_[c.transformIndex];

         reorderedBoxes.push_back(       sceneData.meshData_.boxes_[c.meshIndex]);

         reorderedBoxes.back().transform(model);

      }

      glNamedBufferSubData(    boundingBoxesBuffer.getHandle(), 0,    reorderedBoxes.size() * sizeof(BoundingBox),    reorderedBoxes.data());

  8. The main() function starts exactly the same: it updates the camera position, sets up the viewport, and clears the framebuffers. Once the frustum planes and corners have been retrieved, they are uploaded into the GPU buffer. The actual number of shapes to cull goes into the numShapesToCull field:

      while (!glfwWindowShouldClose(app.getWindow())) {

        positioner.update(      app.getDeltaSeconds(), mouseState.pos,      mouseState.pressedLeft);

        int width, height;

        glfwGetFramebufferSize(      app.getWindow(), &width, &height);

        const float ratio = width / (float)height;

        glViewport(0, 0, width, height);

        glClearNamedFramebufferfv(0, GL_COLOR, 0,      glm::value_ptr(vec4(0.0f, 0.0f, 0.0f, 1.0f)));

        glClearNamedFramebufferfi(      0, GL_DEPTH_STENCIL, 0, 1.0f, 0);

        if (!g_FreezeCullingView)

          g_CullingView = camera.getViewMatrix();

        const mat4 proj = glm::perspective(      45.0f, ratio, 0.1f, 1000.0f);

        const mat4 view = camera.getViewMatrix();

        PerFrameData perFrameData = {      .view = view,      .proj = proj,      .light = mat4(0.0f),      .cameraPos = glm::vec4(        camera.getPosition(), 1.0f),      .numShapesToCull = g_EnableGPUCulling ?        (uint32_t)sceneData.shapes_.size() : 0u     };

        getFrustumPlanes(proj * g_CullingView,      perFrameData.frustumPlanes);

        getFrustumCorners(proj * g_CullingView,      perFrameData.frustumCorners);

        glNamedBufferSubData(      perFrameDataBuffer.getHandle(), 0,      kUniformBufferSize, &perFrameData);

  9. Here, we should invoke the culling shader program. We will reset the number of visible meshes, which is stored in a memory mapped buffer, back to zero. To make sure this value is visible on the GPU side, we must insert a GL_BUFFER_UPDATE_BARRIER_BIT memory barrier. We need to rebind buffers with bounding boxes, draw commands, and the number of visible meshes before we call glDispatchCompute(), since the binding points were messed up by the GLMesh rendering code in the previous frame. Our compute shader operates on the local size of 64, hence the division:

        *numVisibleMeshesPtr = 0;

        programCulling.useProgram();

        glMemoryBarrier(GL_BUFFER_UPDATE_BARRIER_BIT);

        glBindBufferBase(GL_SHADER_STORAGE_BUFFER,      kBufferIndex_BoundingBoxes,      boundingBoxesBuffer.getHandle());

        glBindBufferBase(GL_SHADER_STORAGE_BUFFER,      kBufferIndex_DrawCommands,      mesh.bufferIndirect_.getHandle());

        glBindBufferBase(GL_SHADER_STORAGE_BUFFER,      kBufferIndex_NumVisibleMeshes,      numVisibleMeshesBuffer.getHandle());

        glDispatchCompute(      1 + (GLuint)sceneData.shapes_.size()/64, 1, 1);

    Once the compute shader has been started, we must issue memory barriers to tell OpenGL that it needs to synchronize any further indirect rendering with our compute shader. The GL_COMMAND_BARRIER_BIT flag states that the draw commands are data sourced from buffers by glDraw*Indirect() commands. After this, the barrier will reflect all the data that was written by the shaders prior to the barrier. The buffer objects that are affected by this are derived from the GL_DRAW_INDIRECT_BUFFER binding we use in our rendering code. The GL_SHADER_STORAGE_BARRIER_BIT flag says that access to the shader storage is blocked after the barrier reflects writes prior to the barrier. The second memory barrier and a fence are used to make sure our C++ code can read back the correct number of visible meshes from a memory mapped buffer. The GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT barrier flag makes sure access from C++ to persistent mapped buffers will reflect any data that was written by the shaders prior to the barrier. The OpenGL specification mentions that this may cause additional synchronization operations. It is only being used here for debugging and demonstration purposes:

        glMemoryBarrier(      GL_COMMAND_BARRIER_BIT |      GL_SHADER_STORAGE_BARRIER_BIT);

        glMemoryBarrier(      GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT);

        GLsync fence = glFenceSync(      GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

  10. The scene rendering chunk of code follows the same path as in the previous recipe: the skybox, the Bistro mesh, and then the grid. The frozen culling frustum is rendered with yellow lines to help debug the scene:

        skybox.draw();

        glDisable(GL_BLEND);

        glEnable(GL_DEPTH_TEST);

        if (g_DrawMeshes) {

          program.useProgram();

          mesh.draw(sceneData.shapes_.size());

        }

        if (g_DrawGrid) {

           glEnable(GL_BLEND);

           progGrid.useProgram();

           glDrawArraysInstancedBaseInstance(         GL_TRIANGLES, 0, 6, 1, 0);

        }

        if (g_FreezeCullingView)

          renderCameraFrustumGL(canvas, g_CullingView,        proj, vec4(1, 1, 0, 1), 100);

         canvas.flush();

    Before we can display the number of culled meshes using ImGui, we need to ensure the GPU has finished its culling work and that the value has become available.

  11. Let's simply wait on the fence we created two steps earlier:

        for (;;) {

          GLenum res = glClientWaitSync(        fence, GL_SYNC_FLUSH_COMMANDS_BIT, 1000);

          if (res == GL_ALREADY_SIGNALED ||          res == GL_CONDITION_SATISFIED) break;

        }

        glDeleteSync(fence);

  12. The remaining ImGui code goes like this. After exiting the main loop, do not forget to clean up and unmap the persistent buffer:

        ImGuiIO& io = ImGui::GetIO();

        io.DisplaySize =       ImVec2((float)width, (float)height);

        ImGui::NewFrame();

        ImGui::Begin("Control", nullptr);

        ImGui::Text("Draw:");

        ImGui::Checkbox("Meshes", &g_DrawMeshes);

        ImGui::Checkbox("Grid",  &g_DrawGrid);

        ImGui::Separator();

        ImGui::Checkbox("Enable GPU culling",      &g_EnableGPUCulling);

        ImGui::Checkbox("Freeze culling frustum (P)",      &g_FreezeCullingView);

        ImGui::Separator();

        ImGui::Text("Visible meshes: %i",      *numVisibleMeshesPtr);

        ImGui::End();

        ImGui::Render();

        rendererUI.render(

            width, height, ImGui::GetDrawData());

         app.swapBuffers();

      }

      glUnmapNamedBuffer(    numVisibleMeshesBuffer.getHandle());

      return 0;

    }

That is all for the C++ part. Now, let's take a look at the GLSL compute shader, which is doing all the heavy lifting:

  1. The PerFrameData structure should be identical to its C++ counterpart:

    #version 460 core

    layout(local_size_x = 64, local_size_y = 1,  local_size_z = 1) in;

    layout(std140, binding = 0) uniform PerFrameData {

      mat4 view;

      mat4 proj;

      mat4 light;

      vec4 cameraPos;

      vec4 frustumPlanes[6];

      vec4 frustumCorners[8];

      uint numShapesToCull;

    };

  2. The buffer with bounding boxes is declared in the following way. The BoundingBox structure on the C++ side contains two vec3 values, min and max. Due to the padding requirements of GLSL, we will replace two vec3 with 6 floats and use preprocessor macros to give them meaningful mnemonic names:

    struct AABB {

      float pt[6];

    };

    #define Box_min_x box.pt[0]

    #define Box_min_y box.pt[1]

    #define Box_min_z box.pt[2]

    #define Box_max_x box.pt[3]

    #define Box_max_y box.pt[4]

    #define Box_max_z box.pt[5]

    layout(std430, binding = 1) buffer BoundingBoxes {

      AABB in_AABBs[];

    };

  3. The buffer is declared in the following way with draw commands:

    struct DrawCommand {

      uint count;

      uint instanceCount;

      uint firstIndex;

      uint baseVertex;

      uint baseInstance;

    };

    layout(std430, binding = 2) buffer DrawCommands {

      DrawCommand in_DrawCommands[];

    };

    layout(std430, binding = 3) buffer NumVisibleMeshes {

      uint numVisibleMeshes;

    };

    The DrawCommand structure corresponds to its C++ counterpart. The buffer that specifies the number of visible meshes is optional and is only used for demonstration purposes.

  4. Here's the frustum culling function, which has mostly been copy-pasted from its C++ implementation. An identical two-phase approach was used in the previous Doing frustum culling on the CPU recipe:

    bool isAABBinFrustum(AABB box) {

      for (int i = 0; i < 6; i++) {

        int r = 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_min_x, Box_min_y, Box_min_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_max_x, Box_min_y, Box_min_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_min_x, Box_max_y, Box_min_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_max_x, Box_max_y, Box_min_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_min_x, Box_min_y, Box_max_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_max_x, Box_min_y, Box_max_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_min_x, Box_max_y, Box_max_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        r += ( dot( frustumPlanes[i],      vec4(Box_max_x, Box_max_y, Box_max_z, 1.0f) )      < 0.0 ) ? 1 : 0;

        if ( r == 8 ) return false;

      }

      int r = 0;

      r = 0; for ( int i = 0; i < 8; i++ )

       r += (frustumCorners[i].x > Box_max_x) ? 1 : 0;

      if ( r == 8 ) return false;

      r = 0; for ( int i = 0; i < 8; i++ )

        r += (frustumCorners[i].x < Box_min_x) ? 1 : 0;

      if ( r == 8 ) return false;

      r = 0; for ( int i = 0; i < 8; i++ )

        r += (frustumCorners[i].y > Box_max_y) ? 1 : 0;

      if ( r == 8 ) return false;

      r = 0; for ( int i = 0; i < 8; i++ )

        r += (frustumCorners[i].y < Box_min_y) ? 1 : 0;

      if ( r == 8 ) return false;

      r = 0; for ( int i = 0; i < 8; i++ )

       r += (frustumCorners[i].z > Box_max_z) ? 1 : 0;

      if ( r == 8 ) return false;

      r = 0; for ( int i = 0; i < 8; i++ )

       r += (frustumCorners[i].z < Box_min_z) ? 1 : 0;

      if ( r == 8 ) return false;

      return true;

    }

  5. Now, the main() function of the shader is straightforward. The gl_GlobalInvocationID value is used to identify which DrawCommand we are processing now. The lower 16 bits of the baseInstance field are used to retrieve the index of this mesh in the big collection of meshes, as we explained in the previous chapter in the Implementing lightweight rendering queues in OpenGL recipe. If frustum culling is enabled, we invoke isAABBinFrustum() and set the number of instances based on its return value, exactly as we did in C++. Don't forget to atomically increment the number of visible meshes after that:

    void main() {

      const uint idx = gl_GlobalInvocationID.x;

      // skip items beyond sceneData.shapes_.size()  if (idx < numShapesToCull) {

        AABB box = in_AABBs[      in_DrawCommands[idx].baseInstance >> 16];

        uint numInstances = isAABBinFrustum(box) ? 1 : 0;

        in_DrawCommands[idx].instanceCount =      numInstances;

        atomicAdd(numVisibleMeshes, numInstances);

      }

      else {

        in_DrawCommands[idx].instanceCount = 1;

      }

    }

The demo application produces the following image:

Figure 10.2 – Lumberyard Bistro scene with GPU frustum culling

Figure 10.2 – Lumberyard Bistro scene with GPU frustum culling

The culling frustum is frozen and only visible geometry is rendered. Try flying around and toggling frustum freezing with the P key.

There's more...

It is important to mention that our culling pipeline is based on the assumption that setting the number of instances in DrawElementsIndirectCommand to 0 completely eliminates the penalty of rendering a mesh. In reality, this is not true, and you should consider compacting the indirect buffer by completely removing the culled items from it, or constructing a new indirect buffer directly from the compute shader using the atomic value of numVisibleMeshes as an index. This is especially important if culling is not done every frame and the buffer is reused across multiple frames.

There is a great presentation called GPU-Driven Rendering Pipelines, by Ulrich Haar and Sebastian Aaltonen, that dives deep into GPU culling and using compute functionality in rendering pipelines: http://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf.

Implementing order-independent transparency

So far, we've rendered transparent objects using a very simple punch-through transparency method, as described in the Implementing a material system recipe of Chapter 7, Graphics Rendering Pipeline. This approach was very limited in terms of quality, but it allowed us to render transparent objects together with opaque objects, which greatly simplified the entire rendering pipeline. Let's push this thing a bit further and implement another approach that allows correctly blended transparent objects.

Alpha blending multiple surfaces is an operation that requires all transparent surfaces to be sorted back to front. There are multiple ways to do this, such as sorting scene objects back to front, using multiple pass depth peeling techniques (https://matthewwellings.com/blog/depth-peeling-order-independent-transparency-in-vulkan), making order-independent approximations (http://casual-effects.blogspot.com/2014/03/weighted-blended-order-independent.html), and following a more recent work called Phenomenological Transparency by Morgan McGuire and Michael Mara (https://research.nvidia.com/publication/phenomenological-transparency).

Starting from OpenGL 4.2, it is possible to implement order-independent transparency using per-pixel linked lists via atomic counters and load-store-atomic read-modify-write operations on textures. This method is order-independent, which means it does not require any transparent geometry to be sorted prior to rendering. All necessary sorting happens in a fragment shader at the pixel level, after the actual scene has been rendered. The idea of the algorithm is to construct a linked list of fragments for each pixel of the screen, which helps with storing the color and depth values at each node of the list. Once the per-pixel lists have been constructed, we can sort them and blend them together using a full-screen fragment shader. Essentially, this is a two-pass algorithm. Our implementation is inspired by https://fr.slideshare.net/hgruen/oit-and-indirect-illumination-using-dx11-linked-lists. Let's take a look at how it works.

Getting ready

We recommend recapping on the Implementing a material system recipe of Chapter 7, Graphics Rendering Pipeline, to refresh yourself on the data structures of our material system. This example will use multiple rendering queues, all of which were described in the Implementing lightweight rendering queues in OpenGL recipe of the previous chapter.

The source code for this recipe can be found in Chapter10/GL03_OITransparency.

How to do it…

Our rendering pipeline for this example goes as follows. First, we will render opaque objects with standard shading, as we did in the previous recipe. Then, we will render transparent objects and add shaded fragments to linked lists instead of rendering them directly into the framebuffer. Finally, we will sort the linked lists and overlay the blended image on top of the opaque framebuffer.

Let's go through the C++ code for implementing this technique:

  1. The per-frame data being used here is the same as in the Doing frustum culling on the CPU recipe. We will skip the GLFW mouse and keyboard handling code here and focus only on the graphics part. The GLSL shader code is shared between multiple demos, hence the unused light field here:

    struct PerFrameData {

      mat4 view;

      mat4 proj;

      mat4 light = mat4(0.0f); // unused in this demo

      vec4 cameraPos;

    };

    bool g_DrawOpaque = true;

    bool g_DrawTransparent = true;

    bool g_DrawGrid = true;

    The main() function starts by loading the necessary shaders. The grid shader remains the same. The shaders for scene rendering are now split into two sets. The first set, GL01_scene_IBL.vert and GL01_scene_IBL.frag, is used for opaque objects. The second set, GL01_scene_IBL.vert and GL03_mesh_oit.frag, is used for transparent ones. A full-screen post-processing pass is handled in GL03_OIT.frag. We will look at the shaders right after the C++ code:

    int main(void) {

      GLApp app;

      GLShader shdGridVert(    "data/shaders/chapter05/GL01_grid.vert");

      GLShader shdGridFrag(    "data/shaders/chapter05/GL01_grid.frag");

      GLProgram progGrid(shdGridVert, shdGridFrag);

      GLShader shaderVert(    "data/shaders/chapter10/GL01_scene_IBL.vert");

      GLShader shaderFrag(    "data/shaders/chapter10/GL01_scene_IBL.frag");

      GLProgram program(shaderVert, shaderFrag);

      GLShader shaderFragOIT(    "data/shaders/chapter10/GL03_mesh_oit.frag");

      GLProgram programOIT(shaderVert, shaderFragOIT);

      GLShader shdFullScreenQuadVert(   "data/shaders/chapter08/GL02_FullScreenQuad.vert");

      GLShader shdCombineOIT(    "data/shaders/chapter10/GL03_OIT.frag");

      GLProgram progCombineOIT(    shdFullScreenQuadVert, shdCombineOIT);

    The per-frame uniforms, OpenGL state, and meshes should be set up the same way as in the previous demos:

      const GLsizeiptr kUniformBufferSize =    sizeof(PerFrameData);

      GLBuffer perFrameDataBuffer(kUniformBufferSize,    nullptr, GL_DYNAMIC_STORAGE_BIT);

      glBindBufferRange(GL_UNIFORM_BUFFER,    kBufferIndex_PerFrameUniforms,    perFrameDataBuffer.getHandle(), 0,    kUniformBufferSize);

      glClearColor(1.0f, 1.0f, 1.0f, 1.0f);

      glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);

      glEnable(GL_DEPTH_TEST);

      GLSceneData sceneData(    "data/meshes/bistro_all.meshes",    "data/meshes/bistro_all.scene",    "data/meshes/bistro_all.materials");

      GLMesh mesh(sceneData);

      GLSkyboxRenderer skybox;

      ImGuiGLRenderer rendererUI;

  2. Let's allocate two indirect buffers, as described in Implementing lightweight rendering queues in OpenGL – one for opaque meshes and the other for transparent ones. A simple lambda can be used to discriminate draw commands based on the sMaterialFlags_Transparent bit in the material flags that were set up by the SceneConverter tool from Chapter 7, Graphics Rendering Pipeline:

      GLIndirectBuffer meshesOpaque(    sceneData.shapes_.size());

      GLIndirectBuffer meshesTransparent(    sceneData.shapes_.size());

      auto isTransparent = [&sceneData](    const DrawElementsIndirectCommand& c) {

        const auto mtlIndex = c.baseInstance_ & 0xffff;

       const auto& mtl = sceneData.materials_[mtlIndex];

        return       (mtl.flags_ & sMaterialFlags_Transparent) > 0;

      };

      mesh.bufferIndirect_.selectTo(meshesOpaque,    [&isTransparent](const       DrawElementsIndirectCommand& c) -> bool {

            return !isTransparent(c);

          });

      mesh.bufferIndirect_.selectTo(meshesTransparent,    [&isTransparent](const       DrawElementsIndirectCommand& c) -> bool {

            return isTransparent(c);

          });

  3. Besides rendering queues, we require a bunch of buffers to store linked lists data. The C++ TransparentFragment structure is a replica of the related structure in our GLSL shaders, which stores a single node of a per-pixel linked list. We require a floating-point color value with an alpha channel to accommodate HDR reflections on transparent objects, a depth value, and an index of the next node:

      struct TransparentFragment {

        float R, G, B, A;

        float depth;

        uint32_t next;

      };

    Note

    In real-world applications, 32-bit floats for color channels can be replaced with 16-bit half-floats to save some memory when using the appropriate GLSL extensions. In C++, they can be mimicked using uint16_t. We avoid using vendor-specific OpenGL extensions for the sake of simplicity.

  4. The buffer allocation goes as follows. We allocate storage to allow an overdraw of 16M transparent fragments. Anything beyond that will be clipped by our fragment shader, which performs bounds checking:

      int width, height;

      glfwGetFramebufferSize(    app.getWindow(), &width, &height);

      GLFramebuffer framebuffer(    width, height, GL_RGBA8, GL_DEPTH_COMPONENT24);

      const uint32_t kMaxOITFragments = 16 * 1024 * 1024;

      const uint32_t kBufferIndex_TransparencyLists =    kBufferIndex_Materials + 1;

      GLBuffer oitAtomicCounter(sizeof(uint32_t),    nullptr, GL_DYNAMIC_STORAGE_BIT);

      GLBuffer oitTransparencyLists(    sizeof(TransparentFragment) * kMaxOITFragments,    nullptr, GL_DYNAMIC_STORAGE_BIT);

      GLTexture oitHeads(GL_TEXTURE_2D,    width, height, GL_R32UI);

      glBindImageTexture(0, oitHeads.getHandle(), 0,    GL_FALSE, 0, GL_READ_WRITE, GL_R32UI);

      glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0,    oitAtomicCounter.getHandle());

      glBindBufferBase(GL_SHADER_STORAGE_BUFFER,    kBufferIndex_TransparencyLists,    oitTransparencyLists.getHandle());

    The oitAtomicCounter buffer contains the total number of allocated fragments and is used as an atomic counter in a linear memory allocator, which is implemented in the GLSL shader. The oitTransparencyLists buffer is an actual pool of memory that's used to store linked lists. oitHeadsBuffer contains the integer values of the heads of per-pixel linked lists.

  5. Every frame, we should reset the atomic counter back to 0 and clear the heads buffer with the 0xFFFFFFFF value, which is a guard value to signal that there are no transparent fragments for this pixel of the viewport. Here is a lambda to do so:

      auto clearTransparencyBuffers =    [&oitAtomicCounter, &oitHeads]() {

          const uint32_t minusOne = 0xFFFFFFFF;      const uint32_t zero = 0;

          glClearTexImage(oitHeads.getHandle(), 0,        GL_RED_INTEGER, GL_UNSIGNED_INT, &minusOne);

          glNamedBufferSubData(        oitAtomicCounter.getHandle(),        0, sizeof(uint32_t), &zero);

        };

  6. All the preparations are now complete, and we can enter the main loop. Clear the main framebuffer and upload the per-frame uniforms:

      while (!glfwWindowShouldClose(app.getWindow())) {

        positioner.update(      app.getDeltaSeconds(), mouseState.pos,      mouseState.pressedLeft);

        int width, height;

        glfwGetFramebufferSize(      app.getWindow(), &width, &height);

        const float ratio = width / (float)height;

        glViewport(0, 0, width, height);

        glClearNamedFramebufferfv(      framebuffer.getHandle(), GL_COLOR, 0,      glm::value_ptr(vec4(0.0f, 0.0f, 0.0f, 1.0f)));

        glClearNamedFramebufferfi(framebuffer.getHandle(),      GL_DEPTH_STENCIL, 0, 1.0f, 0);

        const mat4 proj = glm::perspective(      45.0f, ratio, 0.1f, 1000.0f);

        const mat4 view = camera.getViewMatrix();

        PerFrameData perFrameData = {      .view = view,      .proj = proj,      .light =  mat4(0.0f),      .cameraPos =        glm::vec4(camera.getPosition(), 1.0f)    };

        glNamedBufferSubData(      perFrameDataBuffer.getHandle(), 0,      kUniformBufferSize, &perFrameData);

  7. Scene rendering starts with clearing the transparency buffers and drawing the skybox. Then, we can render the opaque meshes of the Bistro scene using the corresponding rendering queue:

        clearTransparencyBuffers();

        framebuffer.bind();

        skybox.draw();

        glDisable(GL_BLEND);

        glEnable(GL_DEPTH_TEST);

        if (g_DrawOpaque) {

          program.useProgram();

          mesh.draw(meshesOpaque.drawCommands_.size(),        &meshesOpaque);

        }

  8. The grid is rendered after all the opaque objects have been rendered. This way, it will be correctly clipped to the opaque geometry and lie under transparent objects:

        if (g_DrawGrid) {

          glEnable(GL_BLEND);

          progGrid.useProgram();

          glDrawArraysInstancedBaseInstance(        GL_TRIANGLES, 0, 6, 1, 0);

          glDisable(GL_BLEND);

        }

    Transparent meshes are rendered differently, with the depth buffer writes turned off but the depth test enabled. This way, all transparent objects will be correctly clipped against opaque geometry, while not affecting each other. Rendering with programOIT should not impact the color buffer either, since all the transparent fragments go into per-pixel linked lists only. Once we have rendered the transparent objects, a flush and a memory barrier called GL_SHADER_STORAGE_BARRIER_BIT is required, so that subsequent reads from linked lists stored in an SSBO can correctly access the data written by the programOIT shader program:

        if (g_DrawTransparent) {

          glDepthMask(GL_FALSE);

          programOIT.useProgram();

          mesh.draw(        meshesTransparent.drawCommands_.size(),        &meshesTransparent);

          glFlush();

          glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

          glDepthMask(GL_TRUE);

        }

        framebuffer.unbind();

  9. Now, we should run our full screen pass to blend the fragments stored in the linked lists. The framebuffer is used as input to provide the color "below" the transparent fragments. Don't forget to disable the depth test as well as blending, which is going to be handled manually in the shader:

        glDisable(GL_DEPTH_TEST);

        glDisable(GL_BLEND);

        progCombineOIT.useProgram();

        glBindTextureUnit(0,      framebuffer.getTextureColor().getHandle());

        glDrawArrays(GL_TRIANGLES, 0, 6);

  10. The last part of the main loop is the ImGui interface, which controls the rendering process:

        ImGuiIO& io = ImGui::GetIO();

        io.DisplaySize =      ImVec2((float)width, (float)height);

        ImGui::NewFrame();

        ImGui::Begin("Control", nullptr);

        ImGui::Text("Draw:");

        ImGui::Checkbox("Opaque meshes", &g_DrawOpaque);

        ImGui::Checkbox("Transparent meshes",      &g_DrawTransparent);

        ImGui::Checkbox("Grid", &g_DrawGrid);

        ImGui::End();

        ImGui::Render();

        rendererUI.render(      width, height, ImGui::GetDrawData());

        app.swapBuffers();

      }

      return 0;

    }

That is all for the C++ part. Now, let's switch to the GLSL code and look at both the new shaders that handle transparent objects.

The first fragment shader, GL03_mesh_oit.frag, takes care of populating the linked lists. Let's go through the entire source code to make things clearer:

  1. First, we will declare the required extensions and include our material description header from Chapter 7, Graphics Rendering Pipeline:

    #version 460 core

    #extension GL_ARB_bindless_texture : require

    #extension GL_ARB_gpu_shader_int64 : enable

    #include <data/shaders/chapter07/MaterialData.h>

    layout (early_fragment_tests) in;

    #include   <data/shaders/chapter10/GLBufferDeclarations.h>

    layout(std430, binding = 2) restrict readonly buffer Materials {

      MaterialData in_Materials[];

    };

    The early_fragment_tests layout specifier states that we want to prevent this fragment shader from being executed unnecessarily, should the fragment be discarded based on the depth test. This is necessary, as any redundant invocation of this shader will result in messed-up transparency lists.

  2. The fragment shader inputs correspond to the GL01_scene_IBL.vert vertex shader, which was described earlier:

    layout (location=0) in vec2 v_tc;

    layout (location=1) in vec3 v_worldNormal;

    layout (location=2) in vec3 v_worldPos;

    layout (location=3) in flat uint matIdx;

    layout (location=0) out vec4 out_FragColor;

    layout (binding = 5) uniform samplerCube texEnvMap;

    layout (binding = 6) uniform samplerCube   texEnvMapIrradiance;

    layout (binding = 7) uniform sampler2D texBRDF_LUT;

    #include <data/shaders/chapter07/AlphaTest.h>

    #include <data/shaders/chapter06/PBR.sp>

  3. The TransparentFragment structure here corresponds to the respective C++ structure. The bindings of auxiliary buffers should match those in the C++ code as well:

    struct TransparentFragment {

      vec4 color;

      float depth;

      uint next;

    };

    layout (binding = 0, r32ui) uniform uimage2D heads;

    layout (binding = 0, offset = 0) uniform atomic_uint   numFragments;

    layout (std430, binding = 3) buffer Lists {

      TransparentFragment Fragments[];

    };

  4. The main() function calculates perturbed normals and does simple image-based diffuse lighting using an irradiance map, similar to what we did in Chapter 7, Graphics Rendering Pipeline. The diffuse part of the physically-based shading comes from chapter06/PBR.sp:

    void main() {

      MaterialData mtl = in_Materials[matIdx];

      vec4 albedo = mtl.albedoColor_;

      vec3 normalSample = vec3(0.0, 0.0, 0.0);

      if (mtl.albedoMap_ > 0)

        albedo = texture(sampler2D(      unpackUint2x32(mtl.albedoMap_)), v_tc);

      if (mtl.normalMap_ > 0)

        normalSample = texture(sampler2D(      unpackUint2x32(mtl.normalMap_)), v_tc).xyz;

      vec3 n = normalize(v_worldNormal);

      if (length(normalSample) > 0.5)

        n = perturbNormal(n,          normalize(cameraPos.xyz - v_worldPos.xyz),          normalSample, v_tc);

      vec3 f0 = vec3(0.04);

      vec3 diffuseColor = albedo.rgb * (vec3(1.0) - f0);

      vec3 diffuse = texture(    texEnvMapIrradiance, n.xyz).rgb * diffuseColor;

  5. Let's add some ad hoc environment reflections for transparent objects and store the resulting color in out_FragColor. This is just a temporary variable, given that color writes were disabled for this shader in the C++ code:

      vec3 v = normalize(cameraPos.xyz - v_worldPos);

      vec3 reflection = reflect(v, n);

      vec3 colorRefl = texture(texEnvMap, reflection).rgb;

      out_FragColor = vec4(diffuse + colorRefl, 1.0);

  6. Once we have calculated the color value, we can insert this fragment into the corresponding linked list:

      float alpha = clamp(albedo.a, 0.0, 1.0) *                mtl.transparencyFactor_;

      bool isTransparent = alpha < 0.99;

      if (isTransparent && gl_HelperInvocation == false) {

        if (alpha > 0.01) {

          uint index =         atomicCounterIncrement(numFragments);

          const uint maxOITfragments = 16*1024*1024;

          if (index < maxOITfragments) {

            uint prevIndex = imageAtomicExchange(          heads, ivec2(gl_FragCoord.xy), index);

            Fragments[index].color =          vec4(out_FragColor.rgb, alpha);

            Fragments[index].depth = gl_FragCoord.z;

            Fragments[index].next  = prevIndex;

          }

       }

        discard;

      }

    };

    Besides the value of mtl.transparencyFactor_, transparency can be controlled by the values in the alpha channel of the albedo texture. The gl_HelperInvocation value tells us whether this fragment shader invocation is considered a helper invocation. As the OpenGL specification states, a helper invocation is a fragment shader invocation that is created solely for the purpose of evaluating derivatives for use in non-helper fragment shader invocations. Helper invocations should not disturb our linked lists data. A linear memory allocator is implemented using imageAtomicExchange(), which replaces the head of a corresponding linked list with a new value.

    Note

    While being correct for transparent fragments, the preceding code has a serious limitation when it comes to opaque fragments where the isTransparent value is false. Opaque fragments do not contribute to linked lists and end up in the framebuffer. This is seemingly OK at first glance, but it is not correct. With the depth writes disabled, opaque fragments are not Z-sorted properly, resulting in an incorrectly rendered image when the opaque fragments of one transparent surface overlay another transparent surface containing opaque fragments. This can be avoided by adding everything to linked lists, regardless of the isTransparent value. Whether you should choose increased memory consumption or visuals is up to you. We are just showing the possibilities of making such optimizations.

Now that we have populated the linked lists, let's learn how to blend and combine them into the resulting image. This can done in a full-screen pass in the GL03_OIT.frag shader:

  1. Fragment shader inputs contain a texScene texture that provides all the opaque objects in rendered format. The TransparentFragment structure should match the one from the previous shader. The buffer binding points correspond to the C++ code:

    #version 460 core

    layout(location = 0) in vec2 uv;

    layout(location = 0) out vec4 out_FragColor;

    layout(binding = 0) uniform sampler2D texScene;

    struct TransparentFragment {

      vec4 color;

      float depth;

      uint next;

    };

    layout (binding = 0, r32ui) uniform uimage2D heads;

    layout (std430, binding = 3) buffer Lists {

      TransparentFragment fragments[];

    };

  2. The main() function starts by copying the linked list for this pixel into a local array. This is done in a loop until we reach the 0xFFFFFFFF guard marker or we reach the maximal number of overlapping transparent fragments; that is, MAX_FRAGMENTS:

    void main() {

      #define MAX_FRAGMENTS 64

      TransparentFragment frags[64];

      int numFragments = 0;

      uint idx = imageLoad(    heads, ivec2(gl_FragCoord.xy)).r;

      while (idx != 0xFFFFFFFF &&         numFragments < MAX_FRAGMENTS) {

        frags[numFragments] = fragments[idx];

        numFragments++;

        idx = fragments[idx].next;

      }

  3. Let's sort the array by depth by using insertion sort from largest to smallest. This is fast, considering we have a reasonably small number of overlapping fragments:

      for (int i = 1; i < numFragments; i++) {

        TransparentFragment toInsert = frags[i];

        uint j = i;

        while (j > 0 && toInsert.depth >           frags[j-1].depth) {

          frags[j] = frags[j-1];

          j--;

        }

        frags[j] = toInsert;

      }

  4. Finally, we can blend the fragments together. Get the color of the closest non-transparent object from the frame buffer and traverse the array blending colors based on their alpha values. Clamping is necessary to prevent any HDR values leaking into the alpha channel:

      vec4 color = texture(texScene, uv);

      for (int i = 0; i < numFragments; i++)

        color = mix(color, vec4(frags[i].color),      clamp(float(frags[i].color.a), 0.0, 1.0));

      out_FragColor = color;

    }

The demo application should render the following image:

Figure 10.3 – Inside the Lumberyard Bistro bar with order-independent transparency

Figure 10.3 – Inside the Lumberyard Bistro bar with order-independent transparency

Make sure you check the bottles and glasses on the tables, windows, and how all these objects overlay each other.

Loading texture assets asynchronously

All our demos up to this point have preloaded all the assets at startup, before anything can be rendered. This is okay for applications where the size of the data is small, and everything can be loaded in an instant. Once our content gets into the territory of gigabytes, a mechanism would be desirable to stream assets as required. Let's extend our demos with some basic lazy-loading functionality to load textures while the application is already rendering a scene. Multithreading will be done using the Taskflow library and standard C++14 capabilities.

Getting ready

We recommend revisiting the Multithreading with Taskflow recipe of Chapter 2, Using Essential Libraries.

The source code for this recipe can be found in Chapter10/GL04_LazyLoading.

How to do it...

To make things simple, the idea behind our approach is to replace the GLSceneData class, which handled all scene loading so far, with another class called GLSceneDataLazy, which is capable of loading textures lazily. As we explained in the Implementing lightweight rendering queues in OpenGL recipe in the previous chapter, our new GLMesh class is a template that's been parameterized with GLSceneDataType and can accept both GLSceneData and GLSceneDataLazy. Let's take a look at the declaration of a new scene loading class in the shared/glFramework/GLSceneDataLazy.h file:

  1. We will store a dummy texture that will be used as a replacement for any texture that hasn't been loaded yet. In our case, it is a white 1x1 image. The textureFiles_ container stores the filenames of all the textures in the scene and is populated at startup in the class constructor:

    class GLSceneDataLazy {

    public:

      const std::shared_ptr<GLTexture> dummyTexture_ =    std::make_shared<GLTexture>(      GL_TEXTURE_2D, "data/const1.bmp");

      std::vector<std::string> textureFiles_;

    The LoadedImageData structure contains all the information that's required to load an RGBA8 uncompressed texture. The index_ field stores an index of the loaded texture inside allMaterialTextures_. The other members are the width, height, and the actual RGBA8 pixel data. Once the loading thread has finished loading a texture from the file, it creates another instance of LoadedImageData and appends it to the loadedFiles_ container, which is guarded by a loadedFilesMutex_ mutex:

      struct LoadedImageData {    int index_ = 0;    int w_ = 0;    int h_ = 0;    const uint8_t* img_ = nullptr;  };

      std::vector<LoadedImageData> loadedFiles_;

      std::mutex loadedFilesMutex_;

      std::vector<std::shared_ptr<GLTexture>>    allMaterialTextures_;

    The geometry and scene graph data is stored intact, similar to how it is done in the original GLSceneData class:

      MeshFileHeader header_;

      MeshData meshData_;

      Scene scene_;

      std::vector<DrawData> shapes_;

  2. When it comes to materials, we must store two copies:

      // materials loaded from the scene file   std::vector<MaterialDescription> materialsLoaded_;

      // materials uploaded into GPU buffers   std::vector<MaterialDescription> materials_;

    The reason for this is that our original code in GLSeneData converts the integer texture indices inside MaterialDescription into OpenGL bindless texture handles. Obviously, this transformation is destructive and once we have a bindless handle, we cannot convert it back into a texture index. To simplify our loading code, we must upload an entirely new copy of the materials into the GPU every time a new texture is loaded. Hence, we require the original indices of textures, which are stored intact in the materialsLoaded_ vector.

  3. Our code for the Taskflow library needs the Executor and Taskflow objects:

      tf::Taskflow taskflow_;

      tf::Executor executor_;

    We will store an instance of Executor right here for the sake of simplicity. However, if there are many instances of GLSceneDataLazy, an external Executor shared by all the scene data instances should be preferred.

  4. The uploadLoadedTextures() method is polled from the main loop. It uploads any new textures that were loaded since the last frame to the GPU, and then calls updateMaterials() to update material descriptions and replace dummy textures with newly loaded ones:

      bool uploadLoadedTextures();

      GLSceneDataLazy(    const char* meshFile,    const char* sceneFile,    const char* materialFile);

    private:

      void updateMaterials();

      void loadScene(const char* sceneFile);

    };

The implementation of the GLSceneDataLazy class is located in the shared/glFramework/GLSceneDataLazy.h file and is pretty straightforward:

  1. There is a static helper function getTextureHandleBindless() that is used to convert a texture index into an OpenGL bindless texture handle:

    static uint64_t getTextureHandleBindless(uint64_t idx,

      const std::vector<shared_ptr<GLTexture>>& textures) {

      if (idx == INVALID_TEXTURE) return 0;

      return textures[idx]->getHandleBindless();

    }

  2. The class constructor loads the mesh, scene graph, and materials data exactly as it was done earlier in the GLSceneData class:

    GLSceneDataLazy::GLSceneDataLazy(  const char* meshFile,  const char* sceneFile,  const char* materialFile)

    {

      header_ = loadMeshData(meshFile, meshData_);

      loadScene(sceneFile);

      loadMaterials(    materialFile, materialsLoaded_, textureFiles_);

    Instead of loading textures from files right here, we will replace each texture with a dummy and update the materials data using these textures. This way, the entire scene will be rendered with white textures applied.

  3. Let's reserve some space in the loadedFiles_ container to make sure no reallocation will be done while we asynchronously append items to it later:

      for (const auto& f: textureFiles_)

        allMaterialTextures_.emplace_back(dummyTexture_);

      updateMaterials();

      loadedFiles_.reserve(textureFiles_.size());

  4. Now, we can initialize the taskflow_ object using the for_each_index() algorithm. It will create an independent asynchronous task for each texture file using the provided lambda. This lambda loads a texture from a file using the STB library and stores the loaded data in loadedFiles_. Once the taskflow_ object has been initialized, we can run it using Executor:

      taskflow_.for_each_index(0u,    (uint32_t)textureFiles_.size(), 1,    [this](int idx) {

          int w, h;

          const uint8_t* img = stbi_load(        this->textureFiles_[idx].c_str(),        &w, &h, nullptr, STBI_rgb_alpha);

          std::lock_guard lock(loadedFilesMutex_);

          loadedFiles_.emplace_back(        LoadedImageData{ idx, w, h, img });

        }

      );

      executor_.run(taskflow_);

    }

  5. The uploadLoadedTextures() method is polled every frame from the main loop. It locks the loadedFilesMutex_ mutex and checks if any new texture was loaded. The explicit scope guard here is used to reduce the amount of work under the locked mutex. Once a new image data has been retrieved, we can create a new OpenGL texture and update the materials accordingly:

    bool GLSceneDataLazy::uploadLoadedTextures()

    {

      LoadedImageData data;

      {

         std::lock_guard lock(loadedFilesMutex_);

         if (loadedFiles_.empty()) return false;

         data = loadedFiles_.back();

         loadedFiles_.pop_back();

      }

      allMaterialTextures_[data.index_] =    std::make_shared<GLTexture>(      data.w_, data.h_, data.img_);

      stbi_image_free((void*)data.img_);

      updateMaterials();

      return true;

    }

  6. The material's updating code is somewhat similar to the GLSceneData implementation, with the exception that this variant is non-destructive. It sources the material data from materialsLoaded_ and converts all texture indices into bindless handles:

    void GLSceneDataLazy::updateMaterials() {

      const size_t numMaterials = materialsLoaded_.size();

      materials_.resize(numMaterials);

      for (size_t i = 0; i != numMaterials; i++) {

        const auto& in = materialsLoaded_[i];

        auto& out = materials_[i];

        out = in;

        out.ambientOcclusionMap_ =      getTextureHandleBindless(        in.ambientOcclusionMap_,        allMaterialTextures_);

        out.emissiveMap_ = getTextureHandleBindless(      in.emissiveMap_, allMaterialTextures_);

        out.albedoMap_ = getTextureHandleBindless(      in.albedoMap_, allMaterialTextures_);

        out.metallicRoughnessMap_ =      getTextureHandleBindless(        in.metallicRoughnessMap_,        allMaterialTextures_);

        out.normalMap_ = getTextureHandleBindless(      in.normalMap_, allMaterialTextures_);

      }

    }

  7. Last but not least, the loadScene() function loads the scene graph data and converts it into a shapes list exactly, as was done in the GLSceneData class:

    void GLSceneDataLazy::loadScene(const char* sceneFile) {

      ::loadScene(sceneFile, scene_);

      for (const auto& c: scene_.meshes_) {

        auto material = scene_.      materialForNode_.find(c.first);

      if (material != scene_.materialForNode_.end())

        shapes_.push_back(DrawData{      .meshIndex = c.second,      .materialIndex = material->second,      .LOD = 0,      .indexOffset =        meshData_.meshes_[c.second].indexOffset,      .vertexOffset =        meshData_.meshes_[c.second].vertexOffset,      .transformIndex = c.first});

      }

      markAsChanged(scene_, 0);

      recalculateGlobalTransforms(scene_);

    }

All the support code is now complete and ready to be tested. Let's take a look at the Chapter10/GL04_LazyLoading demo application to see how the Bistro scene can be rendered with lazy-loaded textures.

The application is based on the code in the previous Implementing order-independent transparency recipe, and will render all the transparent objects exactly as in the previous demo. Here, we will only highlight the differences necessary to introduce lazy-loading:

  1. Inside the main() function, we should use the GLSceneDataLazy class instead of GLSceneData:

    int main(void) {

      ...

      GLSceneDataLazy sceneData(    "data/meshes/bistro_all.meshes",    "data/meshes/bistro_all.scene",    "data/meshes/bistro_all.materials");

      GLMesh mesh(sceneData);

      ...

  2. Inside the main loop, we should poll uploadLoadedTextures() and, in case there are newly loaded textures, upload a new materials buffer into an appropriate OpenGL buffer:

      while (!glfwWindowShouldClose(app.getWindow()))  {

        if (sceneData.uploadLoadedTextures())

          mesh.updateMaterialsBuffer(sceneData);

        ... do all the rendering here         exactly as it was done before ...

      }

      return 0;

    }

And that is it. Just by changing a few lines of code, our previous demo gains asynchronous texture loading functionality. If you run the demo, the geometry data will be loaded and rendered almost instantly, producing an image that's similar to the following:

Figure 10.4 – Lumberyard Bistro scene rendered with dummy textures before asynchronous loading kicks in

Figure 10.4 – Lumberyard Bistro scene rendered with dummy textures before asynchronous loading kicks in

You can navigate the app with a keyboard and mouse while all the textures are streamed in in mere moments. This functionality, while being essential to any serious rendering engine, is very handy for smaller scale applications, where being able to quickly rerun the app improves your debugging capabilities.

There's more...

The approach of asynchronous textures being loaded gives much better responsiveness compared to preloading the entire textures pool at startup. However, a better approach may be to maintain persistent mapped buffers to texture data and load images directly into these memory regions. The best performance can be achieved if the texture data in files has already been compressed into GPU runtime formats. This might be an interesting exercise for you. Texture compression can be added to the SceneConverter tool and additional loading code can be accommodated in the GLSceneDataLazy class.

Another improvement could occur in the uploadLoadedTextures() method. Instead of polling each GLSceneDataLazy object, texture updates might become a part of the events system in your rendering engine. One more thing to mention is that instead of loading only one texture on each iteration, a load balancer could be implemented that can upload multiple textures into the GPU; their number is based on texture sizes through some heuristics. We recommend running the scene converter tool to output 1,024x1,024 textures and so that you can experiment with this approach on a much heavier dataset.

Implementing projective shadows for directional lights

In the previous chapter, we learned how to set up a shadow mapping pipeline with OpenGL and Vulkan. Our focus was more on the API rather than on shadow correctness or versatility. In the Implementing shadow maps in OpenGL recipe of Chapter 8, Image-Based Techniques, we learned how to render shadows from spotlights using perspective projection for our shadow-casting light source. Directional light sources typically simulate sunlight, and a single light source can illuminate the entire scene. In its simplest form, this means we should construct a projection matrix which takes the bounds of the scene into account. Let's take a look at how to implement this basic approach and add some shadows to our Bistro rendering example.

Getting ready

Make sure you revisit the Implementing shadow maps in OpenGL recipe of Chapter 8, Image-Based Techniques, before you learn about the shadow mapping topic.

The calculations described in this recipe are part of the Chapter10/GL05_Final OpenGL demo.

How to do it...

To construct a projection matrix for a directional light, we must calculate the axis-aligned bounding box of the entire scene, transform it into light-space using the light's view matrix, and use the bounds of the transformed box to construct an orthographic frustum that entirely encloses this box.

The scene bounding box can be calculated in two simple steps:

  1. First, we should pretransform the bounding boxes of individual meshes into world space. The same code was used earlier for frustum culling in the Doing frustum culling on the GPU with compute shaders recipe:

    for (const auto& c : sceneData.shapes_)     {

      mat4 model = sceneData.scene_.    globalTransform_[c.transformIndex];

      reorderedBoxes.push_back(    sceneData.meshData_.boxes_[c.meshIndex]);

      reorderedBoxes.back().transform(model);

    }

  2. Then, we can combine all the transformed bounding boxes into one big box:

    BoundingBox bigBox = reorderedBoxes.front();

    for (const auto& b : reorderedBoxes) {

      bigBox.combinePoint(b.min_);

      bigBox.combinePoint(b.max_);

    }

Since our scene is static, we can only do these calculations once outside the main loop.

Now, we can construct a projection matrix for our directional light source. To make things more interactive, the light source direction can be controlled from ImGui using two angles, g_LightTheta and g_LightPhi:

  1. Let's construct two rotation matrices using these two angles and rotate a vertical top-down light at (0, -1, 0) appropriately. The view matrix for our light is constructed using the resulting light direction vector. As the light covers the entire scene, we do not really care about its origin point, so we can put it at (0, 0, 0):

    const glm::mat4 rot1 = glm::rotate(mat4(1.f),  glm::radians(g_LightTheta), vec3(0, 0, 1));

    const glm::mat4 rot2 = glm::rotate(  rot1, glm::radians(g_LightPhi), vec3(1, 0, 0));

    const vec3 lightDir = glm::normalize(  vec3(rot2 * vec4(0.0f, -1.0f, 0.0f, 1.0f)));

    const mat4 lightView = glm::lookAt(  vec3(0), lightDir, vec3(0, 0, 1));

  2. Now, we can use this light's view matrix to transform the world-space bounding box of the scene into light-space. OpenGL's NDC space has its Z-axis facing forward, away from the camera, so we should negate and swap the Z coordinates here:

    const BoundingBox box =  bigBox.getTransformed(lightView);

    const mat4 lightProj = glm::ortho(   box.min_.x,  box.max_.x,   box.min_.y,  box.max_.y,  -box.max_.z, -box.min_.z);

  3. The light's view and projection matrices are passed into GLSL shaders via our traditional PerFrameData structure:

    PerFrameData perFrameData = {  .view = view,  .proj = proj,  .light = lightProj * lightView,  .cameraPos = glm::vec4(camera.getPosition(), 1.0f),  .enableCulling = g_EnableGPUCulling ? 1u : 0u };

The remaining shadow map rendering logic remains exactly the same as in our previous spotlight example. Here's a screenshot from the running demo showing what the shadow map for the entire scene should look like, with the g_LightTheta and g_LightPhi angles set to 0:

Figure 10.5 – A shadow map for top-down directional light

Figure 10.5 – A shadow map for top-down directional light

Note how the light's viewing frustum is fit to the scene bounds.

There's more...

This type of projection for directional shadow maps covers the entire scene, which results in significant aliasing. Instead of using this orthographic projection as-is, we can wrap it in such ways that parts of the scene closer to the camera occupy a larger area of the shadow map. One of the first algorithms of this kind was called is Perspective Shadow Maps (PSMs) and was presented at SIGGRAPH 2002 by Stamminger and Drettakis. It uses post-projective space, where all nearby objects become larger than farther ones. A more recent algorithm is Light Space Perspective Shadow Maps (https://www.cg.tuwien.ac.at/research/vr/lispsm).

A more generic technique to reduce shadow maps aliasing is to split one big light frustum into multiple frustums, or cascades, which are at different distances from the camera, render multiple shadow maps, and apply them based on their distance from the viewer. Google the Parallel Split Shadow Maps and Cascaded Shadow Maps algorithms for more information.

Furthermore, PSMs and cascades can be combined into a single shadow mapping pipeline to produce the best results.

Now, we are ready to combine multiple effects and techniques into a single application. But before we do that, let's switch back to Vulkan and learn how to use atomics to visualize the rendering order of fragments.

Using GPU atomic counters in Vulkan

The goal of this recipe is to introduce atomics in Vulkan and demonstrate how to use them to see how the GPU scheduler distributes the fragment shader workload. In a sense, this is the order in which the GPU rasterizes triangles into fragments on the screen. Let's learn how to implement a Vulkan application to visualize the rendering order of fragments in a full-screen quad consisting of two triangles.

Getting ready

The demo from Implementing order-independent transparency uses an atomic counter to maintain per-pixel linked lists of transparent fragments. Make sure you read that recipe. Please recall our Vulkan scene data structures by reading the Working with shapes lists in Vulkan recipe of Chapter 9, Working with Scene Graphs.

The source code for this recipe can be found in Chapter10/VK01_AtomicsTest.

How to do it...

The application runs using just two render passes. The first pass renders a full-screen quad and, at each fragment shader's invocation, saves its coordinates by atomically appending them to an SSBO buffer. The second pass takes point coordinates from the list and outputs these points onto the screen. The user can control the percentage of displayed points. The visual randomness of the points reflects the order in which the GPU schedules the fragment workload. It is worth comparing the output of this demo across various GPUs of different vendors.

Let's take a look at the GLSL shaders. The vertex shader to render a full-screen quad was described and used extensively in this book. The real work in the first pass is done in the fragment shader:

  1. The fragment shader fills the value buffer with fragment coordinates. To keep track of the number of added fragments, we can use the count buffer, which holds an integer atomic counter. To enable atomic buffer access instructions, we can set the GLSL version to 4.6. The value output buffer contains a fragment index and its screen coordinates. The ubo uniform buffer contains the framebuffer dimensions for calculating the number of output pixels:

    #version 460 core

    layout(set = 0, binding = 0) buffer Atomic   { uint count; };

    struct node { uint idx; float xx, yy; };

    layout (set = 0, binding = 1) buffer Values   { node value[]; };

    layout (set = 0, binding = 2) uniform UniformBuffer   { float width; float height; } ubo;

    layout(location = 0) out vec4 outColor;

  2. The main function calculates the total number of screen pixels and compares our current counter with the number of pixels. The gl_HelperInvocation check helps us avoid touching the fragments list for any helper invocations of the fragment shader. If there is still space in the buffer, we increment the counter atomically and write out a new fragment. As we are not actually outputting anything to the frame buffer, we can use discard after updating the list:

    void main() {

      const uint maxPixels =    uint(ubo.width) * uint(ubo.height);

      if (count < maxPixels &&      gl_HelperInvocation == false) {

        uint idx = atomicAdd(count, 1);

        // exchange new head index and previous head index

        value[idx].idx = idx;

        value[idx].xx  = gl_FragCoord.x;

        value[idx].yy  = gl_FragCoord.y;

      }

      discard;

    }

Let's take a look at the C++ part:

  1. The application class creates three Renderers – the GUI, the atomic buffer filler and the point list renderer – in the constructor:

    struct MyApp: public CameraApp {

      MyApp()

      : CameraApp(-80, -80, {     .vertexPipelineStoresAndAtomics_ = true,     .fragmentStoresAndAtomics_ = true })

      , sizeBuffer(ctx_.resources.addUniformBuffer(8))

      , atom(ctx_, sizeBuffer)

      , anim(ctx_, atom.getOutputs(), sizeBuffer)

      , imgui(ctx_, std::vector<VulkanTexture>{})

      {

         onScreenRenderers_.emplace_back(atom, false);

         onScreenRenderers_.emplace_back(anim, false);

         onScreenRenderers_.emplace_back(imgui, false);

  2. The framebuffer dimensions are uploaded to the uniform buffer:

         struct WH {

            float w, h;

         } wh {        (float)ctx_.vkDev.framebufferWidth,        (float)ctx_.vkDev.framebufferHeight      };

         uploadBufferData(ctx_.vkDev, sizeBuffer.memory,       0, &wh, sizeof(wh));

      }

  3. The only thing we must do to render the frame is add a Settings window, with the slider configuring the percentage of rendered points:

      void draw3D() override {}

      void drawUI() override {

        ImGui::Begin("Settings", nullptr);

        ImGui::SliderFloat("Percentage",      &g_Percentage, 0.0f, 1.0f);

        ImGui::End();

      }

    private:

      VulkanBuffer sizeBuffer;

      AtomicRenderer atom;

      AnimRenderer anim;

      GuiRenderer imgui;

    };

The visualization part is handled by another pair of GLSL shaders, VK01_AtomicVisualize.frag and VK01_AtomicVisualize.vert:

  1. The VK01_AtomicVisualize.vert vertex shader goes through the list and calculates the position and color of the fragment. Darker colors correspond to the earliest fragments:

    #version 460 core

    struct node { uint idx; float xx, yy; };

    layout (set = 0, binding = 0) buffer Values   { node value[]; };

    layout (set = 0, binding = 1) uniform UniformBuffer   { float width; float height; } ubo;

    layout(location = 0) out vec4 color;

    void main() {

      node n = value[gl_VertexIndex];

      gl_Position = vec4(2 * (vec2(    n.xx / ubo.width, n.yy / ubo.height) - vec2(.5)),    0, 1);

      gl_PointSize = 1;

      color = vec4(    (float(n.idx)/ubo.width) / ubo.height, 0, 0, 1);

    }

  2. The fragment shader outputs the color to the framebuffer:

    #version 460 core

    layout(location = 0) in vec4 color;

    layout(location = 0) out vec4 outColor;

    void main() {

      outColor = color;

    }

Let's run the application. By manually dragging the slider from 0 to 1, we can slowly observe the order in which the GPU rasterizes these two triangles, as shown in the following screenshot:

Figure 10.6 – Visualizing the order of fragments and their rasterization using Vulkan atomics

Figure 10.6 – Visualizing the order of fragments and their rasterization using Vulkan atomics

There's more…

Atomics can be used as building blocks for various GPU-based data structures. Order-independent transparency was another example of their use in this book.

This recipe was inspired by the following good old OpenGL demo: https://www.geeks3d.com/20120309/opengl-4-2-atomic-counter-demo-rendering-order-of-fragments.

Putting it all together into an OpenGL demo

To conclude our OpenGL rendering engine, let's combine the techniques from Chapter 8, Image-Based Techniques, Chapter 9, Working with Scene Graphs, and Chapter 10, Advanced Rendering Techniques and Optimizations, into a single application.

Our final OpenGL demo application renders the Lumberyard Bistro scene with the following techniques and effects:

  • Screen-space ambient occlusion
  • HDR rendering with light adaptation
  • Directional shadow mapping with percentage closer filtering
  • GPU frustum culling using compute shaders
  • Order-independent transparency
  • Asynchronously loading textures
  • OpenGL bindless textures

Getting ready

Make sure you have a good grasp of all the recipes from this and the previous chapters.

The demo application for this recipe can be found in the Chapter10/GL05_Final folder.

How to do it...

We are already familiar with all the rendering techniques that will be used in this recipe, so let's quickly go through the C++ source code and highlight some parts that enable the interoperability of these techniques:

  1. The PerFrameData structure combines everything we have into one place:

    struct PerFrameData {  mat4 view;  mat4 proj;  mat4 light;  vec4 cameraPos;  vec4 frustumPlanes[6];  vec4 frustumCorners[8];  uint32_t enableCulling;};

  2. Tweak the SSAO parameters for this specific scene. The structure layout remains the same so that we can work with the original GLSL shaders from the Implementing SSAO in OpenGL recipe:

    struct SSAOParams {  float scale_ = 1.5f;  float bias_ = 0.15f;  float zNear = 0.1f;  float zFar = 1000.0f;  float radius = 0.05f;  float attScale = 1.01f;  float distScale = 0.6f;} g_SSAOParams;

    static_assert(sizeof(SSAOParams) <=  sizeof(PerFrameData));

  3. The same applies to HDR and light adaptation, as described in the Implementing HDR rendering and tone mapping and Implementing HDR light adaptation recipes of Chapter 8, Image-Based Techniques:

    struct HDRParams {  float exposure_ = 0.9f;  float maxWhite_ = 1.17f;  float bloomStrength_ = 1.1f;  float adaptationSpeed_ = 0.1f;} g_HDRParams;

    static_assert(  sizeof(HDRParams) <= sizeof(PerFrameData));

  4. There are a bunch of values that are controlled by ImGui:

    mat4 g_CullingView = camera.getViewMatrix();bool g_EnableGPUCulling = true;bool g_FreezeCullingView = false;bool g_DrawOpaque = true;bool g_DrawTransparent = true;bool g_DrawGrid = true;bool g_EnableSSAO = true;bool g_EnableBlur = true;bool g_EnableHDR = true;bool g_DrawBoxes = false;bool g_EnableShadows = true;bool g_ShowLightFrustum = false;float g_LightTheta = 0.0f;float g_LightPhi = 0.0f;

  5. The main() function starts by loading all the shaders we need:

    int main(void) {

      GLApp app;

      // grid

      GLShader shdGridVert(    "data/shaders/chapter05/GL01_grid.vert");

      GLShader shdGridFrag(    "data/shaders/chapter05/GL01_grid.frag");

      GLProgram progGrid(shdGridVert, shdGridFrag);

      // scene

      GLShader shaderVert(    "data/shaders/chapter10/GL01_scene_IBL.vert");

      GLShader shaderFrag(    "data/shaders/chapter10/GL01_scene_IBL.frag");

      GLProgram program(shaderVert, shaderFrag);

      // generic postprocessing

      GLShader shdFullScreenQuadVert(   "data/shaders/chapter08/GL02_FullScreenQuad.vert");

      // OIT

      GLShader shaderFragOIT(    "data/shaders/chapter10/GL03_mesh_oit.frag");

      GLProgram programOIT(shaderVert, shaderFragOIT);

      GLShader shdCombineOIT(    "data/shaders/chapter10/GL03_OIT.frag");

      GLProgram progCombineOIT(    shdFullScreenQuadVert, shdCombineOIT);

      // GPU culling

      GLShader shaderCulling(   "data/shaders/chapter10/GL02_FrustumCulling.comp");

      GLProgram programCulling(shaderCulling);

      // SSAO

      GLShader shdSSAOFrag(    "data/shaders/chapter08/GL02_SSAO.frag");

      GLShader shdCombineSSAOFrag(    "data/shaders/chapter08/GL02_SSAO_combine.frag");

      GLProgram progSSAO(    shdFullScreenQuadVert, shdSSAOFrag);

      GLProgram progCombineSSAO(    shdFullScreenQuadVert, shdCombineSSAOFrag);

      // blur

      GLShader shdBlurXFrag(    "data/shaders/chapter08/GL02_BlurX.frag");

      GLShader shdBlurYFrag(    "data/shaders/chapter08/GL02_BlurY.frag");

      GLProgram progBlurX(    shdFullScreenQuadVert, shdBlurXFrag);

      GLProgram progBlurY(    shdFullScreenQuadVert, shdBlurYFrag);

      // HDR

      GLShader shdCombineHDR(    "data/shaders/chapter08/GL03_HDR.frag");

      GLProgram progCombineHDR(    shdFullScreenQuadVert, shdCombineHDR);

      GLShader shdToLuminance(    "data/shaders/chapter08/GL03_ToLuminance.frag");

      GLProgram progToLuminance(    shdFullScreenQuadVert, shdToLuminance);

      GLShader shdBrightPass(    "data/shaders/chapter08/GL03_BrightPass.frag");

      GLProgram progBrightPass(    shdFullScreenQuadVert, shdBrightPass);

      GLShader shdAdaptation(    "data/shaders/chapter08/GL03_Adaptation.comp");

      GLProgram progAdaptation(shdAdaptation);

      // shadows

      GLShader shdShadowVert(    "data/shaders/chapter10/GL05_shadow.vert");

      GLShader shdShadowFrag(

        "data/shaders/chapter10/GL05_shadow.frag");

      GLProgram progShadowMap(    shdShadowVert, shdShadowFrag);

  6. To make sure different techniques work together, we should be careful in terms of the binding points for our buffers:

      const GLsizeiptr kBoundingBoxesBufferSize =    sizeof(BoundingBox) * kMaxNumObjects;

      const GLuint kBufferIndex_BoundingBoxes =    kBufferIndex_PerFrameUniforms + 1;

      const GLuint kBufferIndex_DrawCommands =    kBufferIndex_PerFrameUniforms + 2;

      const GLuint kBufferIndex_NumVisibleMeshes =    kBufferIndex_PerFrameUniforms + 3;

  7. The actual buffers' declaration code is skipped for brevity. Now, let's make sure we lazy load the scene data and reuse the skybox, canvas, and UI rendering code. The rotation pattern texture is used for SSAO:

      GLSceneDataLazy sceneData(    "data/meshes/bistro_all.meshes",    "data/meshes/bistro_all.scene",    "data/meshes/bistro_all.materials");

      GLMesh mesh(sceneData);

      GLSkyboxRenderer skybox;

      ImGuiGLRenderer rendererUI;

      CanvasGL canvas;

      GLTexture rotationPattern(    GL_TEXTURE_2D, "data/rot_texture.bmp");

  8. The OIT setup code is identical to the source code from the Implementing order-independent transparency recipe. What we want to do differently here is create two helper functions for ImGui rendering. These lambdas push and pop ImGui flags and styles based on the provided Boolean value so that parts of our UI can be enabled and disabled:

      auto imGuiPushFlagsAndStyles = [](bool value) {

        ImGui::PushItemFlag(      ImGuiItemFlags_Disabled, !value);

        ImGui::PushStyleVar(ImGuiStyleVar_Alpha,      ImGui::GetStyle().Alpha * value ? 1.0f : 0.2f);

      };

      auto imGuiPopFlagsAndStyles = []() {

        ImGui::PopItemFlag();

        ImGui::PopStyleVar();

      };

  9. To render the shadow map in shades of gray, let's use the OpenGL texture swizzling functionality. This way, a single value from an R8 texture is duplicated into three RGB channels. Compare the shadow map's output in this example to the one from the previous chapter:

      GLFramebuffer shadowMap(    8192, 8192, GL_R8, GL_DEPTH_COMPONENT24);

      const GLint swizzleMask[] =    { GL_RED, GL_RED, GL_RED, GL_ONE };

      glTextureParameteriv(    shadowMap.getTextureColor().getHandle(),    GL_TEXTURE_SWIZZLE_RGBA, swizzleMask);

      glTextureParameteriv(    shadowMap.getTextureDepth().getHandle(),    GL_TEXTURE_SWIZZLE_RGBA, swizzleMask);

  10. Inside the main loop, we will use the FPS counter we implemented in the Adding a frames-per-second counter recipe of Chapter 4, Adding User Interaction and Productivity Tools:

      FramesPerSecondCounter fpsCounter(0.5f);

      while (!glfwWindowShouldClose(app.getWindow())) {

        fpsCounter.tick(app.getDeltaSeconds());

        if (sceneData.uploadLoadedTextures())

          mesh.updateMaterialsBuffer(sceneData);

        ...

  11. The following snippet renders a nice ImGUI user interface for our application:

        const float indentSize = 16.0f;

        ImGuiIO& io = ImGui::GetIO();

        io.DisplaySize =      ImVec2((float)width, (float)height);

        ImGui::NewFrame();

        ImGui::Begin("Control", nullptr);

        ImGui::Text("Transparency:");

        ImGui::Indent(indentSize);

        ImGui::Checkbox("Opaque meshes", &g_DrawOpaque);

        ImGui::Checkbox("Transparent meshes",      &g_DrawTransparent);

        ImGui::Unindent(indentSize);

        ImGui::Separator();

        ImGui::Text("GPU culling:");

        ImGui::Indent(indentSize);

        ImGui::Checkbox("Enable GPU culling",      &g_EnableGPUCulling);

        imGuiPushFlagsAndStyles(g_EnableGPUCulling);

        ImGui::Checkbox("Freeze culling frustum (P)",      &g_FreezeCullingView);

        ImGui::Text("Visible meshes: %i",      *numVisibleMeshesPtr);

        imGuiPopFlagsAndStyles();

        ImGui::Unindent(indentSize);

        ImGui::Separator();

        ImGui::Text("SSAO:");

        ImGui::Indent(indentSize);

        ImGui::Checkbox("Enable SSAO", &g_EnableSSAO);

        imGuiPushFlagsAndStyles(g_EnableSSAO);

        ImGui::Checkbox("Enable SSAO blur",      &g_EnableBlur);

        ImGui::SliderFloat("SSAO scale",      &g_SSAOParams.scale_, 0.0f, 2.0f);

        ImGui::SliderFloat("SSAO bias",      &g_SSAOParams.bias_, 0.0f, 0.3f);

        ImGui::SliderFloat("SSAO radius",      &g_SSAOParams.radius, 0.02f, 0.2f);

        ImGui::SliderFloat("SSAO attenuation scale",      &g_SSAOParams.attScale, 0.5f, 1.5f);

        ImGui::SliderFloat("SSAO distance scale",      &g_SSAOParams.distScale, 0.0f, 1.0f);

        imGuiPopFlagsAndStyles();

        ImGui::Unindent(indentSize);

        ImGui::Separator();

        ImGui::Text("HDR:");

        ImGui::Indent(indentSize);

        ImGui::Checkbox("Enable HDR", &g_EnableHDR);

        imGuiPushFlagsAndStyles(g_EnableHDR);

        ImGui::SliderFloat("Exposure",      &g_HDRParams.exposure_, 0.1f, 2.0f);

        ImGui::SliderFloat("Max white",      &g_HDRParams.maxWhite_, 0.5f, 2.0f);

        ImGui::SliderFloat("Bloom strength",      &g_HDRParams.bloomStrength_, 0.0f, 2.0f);

        ImGui::SliderFloat("Adaptation speed",      &g_HDRParams.adaptationSpeed_, 0.01f, 0.5f);

        imGuiPopFlagsAndStyles();

        ImGui::Unindent(indentSize);

       ImGui::Separator();

        ImGui::Text("Shadows:");

        ImGui::Indent(indentSize);

        ImGui::Checkbox("Enable shadows",       &g_EnableShadows);

        imGuiPushFlagsAndStyles(g_EnableShadows);

        ImGui::Checkbox("Show light's frustum (red) and       scene AABB (white)",&g_ShowLightFrustum);

        ImGui::SliderFloat("Light Theta",      &g_LightTheta, -85.0f, +85.0f);

        ImGui::SliderFloat("Light Phi",      &g_LightPhi, -85.0f, +85.0f);

        imGuiPopFlagsAndStyles();

        ImGui::Unindent(indentSize);

        ImGui::Separator();

        ImGui::Checkbox("Grid", &g_DrawGrid);

        ImGui::Checkbox("Bounding boxes (all)",      &g_DrawBoxes);

        ImGui::End();

  12. Render the debug windows with the SSAO results and the shadow map:

        if (g_EnableSSAO)

           imguiTextureWindowGL("SSAO",         ssao.getTextureColor().getHandle());

        if (g_EnableShadows)

           imguiTextureWindowGL("Shadow Map",         shadowMap.getTextureDepth().getHandle());

        ImGui::Render();

       rendererUI.render(      width, height, ImGui::GetDrawData());

        ...

        app.swapBuffers();

      }

      glUnmapNamedBuffer(    numVisibleMeshesBuffer.getHandle());

      glDeleteTextures(1, &luminance1x1);

      return 0;

    }

The demo application should render the Lumberyard Bistro scene like so:

Figure 10.7 – The final OpenGL demo

Figure 10.7 – The final OpenGL demo

Try running the demo and playing with the different UI controls.

And with that, we can conclude our OpenGL examples. By the way, we made a similar Vulkan demo in the Chapter10/VK02_Final project.

There's more…

The possibilities here are endless. You can use this framework to experiment with more advanced rendering techniques. Adding more complex screen space effects, such as temporal antialiasing or screen-space reflections, should be relatively easy. Integrating texture compression into the final demo can be yet another interesting exercise. Adding multiple light sources can be achieved by storing their parameters in a buffer and iterating over it in a fragment shader. Various optimizations are possible here, such as tile deferred shading or clustered shading: http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf.

The easiest way to accommodate shadows from multiple light sources in this demo would be to use separate shadow maps and access them via bindless textures. Going deeper into complex materials rendering might be yet another vast direction in which to explore. A good starting point could be converting the Bistro scene materials into PBR and using the glTF2 rendering code from Chapter 6, Physically-Based Rendering Using the glTF2 Shading Model, to render them properly. If you decide to go in this direction and learn more about the practical side of PBR rendering, the documentation for the Filament engine is an awesome source to explore: https://google.github.io/filament/Filament.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset