JavaScript code is executed on the central processing unit of the computer (the CPU). This computing unit is fast and can do complex operations, but they are always processed sequentially, which makes a major bottleneck. You can use WebWorkers to exploit all the CPUs, but there are almost always less than 16 CPUs on current computers. Furthermore, JavaScript is still an interpreted language, which is not as optimized as compiled languages like C or C++.
Note: All of the source code used in this book can be found here: https://github.com/backstopmedia/deep-learning-browser. And, you can access the demo of our Rock Paper Scissors game here: https://reiinakano.github.io/tfjs-rock-paper-scissors/. Also, you can access the demo of our text generation model here: https://reiinakano.github.io/tfjs-lstm-text-generation/.
The best support for deep learning computations is the GPU. Modern GPUs have hundreds of processing units, even on mobile devices. They can all work together in parallel, usually to compute pixel colors. Fortunately deep learning computations can be heavily parallelized, because each unit of a neuron layer is independent from the other ones of the same layer.
A WebGL deep learning implementation is mandatory to deal with a video stream in real time. The neural network should process each frame of the video, which is an image having many pixels, represented by an input vector in a high dimensional space. This would be too demanding for the CPU. Furthermore, it is easy and efficient to convert a <video>
element to a WebGL texture; a storage in the GPU memory.
WebGL is an OpenGL ES binding for JavaScript. For now, it is the only way to exploit the Graphic Processing Unit (GPU) and the massive parallelization capabilities at a low level to speed up the deep learning computation in the browser. With WebGL you can use 100% of the capabilities of the GPU. GPU instructions (the drawcalls) are the same whether they are sent with WebGL or in a native application (like C) using OpenGL directly.
In this chapter, we first study how WebGL works in order to draw two triangles forming a quad filled with a color gradient. By adding a few lines of graphic code this simple color gradient is metamorphosed into a beautiful colored Mandelbrot fractal. You then will learn to understand the power of WebGL and where most computations will be performed. You will improve the code to build our first GPU simulation, which is the Conway game of life. We tackle precision and optimization problems to go from discrete to continuous simulations with a thermal diffusion demonstration. These two simple applications will help you understand the efficiency of WebGL programs as well introduce you to the WebGL programming model and language.
In the second half of this chapter, we come back to deep learning to explain and implement specific shaders for common matrices operations. These fast matrix operations (such as convolution, pooling, activations, etc.) are the foundation of all deep learning frameworks. We build our own GPGPU linear algebra library called WGLMatrix. We use it to make and train a neuron network which recognizes handwritten digits using the MNIST dataset, which is the hello world application for image classification. Finally, we optimize the learning script to make it five times faster than the Python/Numpy equivalent CPU implementation.
Credit: Jeeliz Sunglasses, jeeliz.com/sunglasses
WebGL deep learning is fast enough to process video stream in real time. This figure is a sunglasses virtual try-on application using the user’s webcam. A convolutional neural network detects the face, its orientation, its rotation, and even the lighting. Then a glasses 3D model is drawn using these informations.
WebGL is not a 3D library. Although there are some tools to facilitate the implementation of 3D rendering algorithms, the whole process of projection using movement and projection matrices needs to be developed with plain matrix operations. WebGL is a rasterization library: it turns vector objects into discrete pixel values.
In this figure on the left: a vectorized cat. Each point is a vector encoding the position from the origin. Such an image can be scaled indefinitely, but cannot be displayed directly on the screen. On the right: a rasterized image formed by pixels.
As we already learned in the introduction, WebGL allows you to run heavily parallelized code by exploiting the rendering pipeline. However, unlike CUDA or OpenCL, WebGL unfortunately is not a general purpose computation pipeline for running parallelized code. Therefore, to benefit from the parallel GPU architecture, we need to transform all deep learning relevant operations into WebGL rendering pipelines. This should give you enough motivation to follow this section with enough attention to understand the basic principles of the rendering pipeline, shaders, and GLSL.
WebGL is standardized by the Khronos Group, like OpenGL. Its specifications are on https://www.khronos.org/registry/webgl/specs. WebGL output is rendered within the <canvas>
element. So, the first step to play with WebGL is to insert a <canvas>
element in the HTML code of the webpage:
<body>
<canvas
id=
'myWebGLCanvas'
height=
'512'
width=
'512'
>
</canvas>
</body>
After the page loads, get the canvas from the DOM in the JavaScript code and create the WebGL context GL
. At this point check if the configuration of the user is compatible with WebGL:
var
myCanvas
=
document
.
getElementById
(
'myWebGLCanvas'
);
var
GL
;
try
{
GL
=
myCanvas
.
getContext
(
'webgl'
,
{
antialias
:
false
,
depth
:
false
}
);
}
catch
(
e
)
{
alert
(
'Cannot init a WebGL context. So sad...:('
);
}
We have disabled antialiasing and depth buffer because we are going to use WebGL for computing instead of rendering. After this operation, the <canvas>
HTML element is fully dedicated to WebGL. It will not be possible to use it for canvas2D rendering or to unbind the WebGL context. In the following samples, the GL
variable is the point of entry of the WebGL API: all WebGL functions and attributes are methods and properties of the context.
From a WebGL point of view, the drawing area is called the viewport. Its coordinate system always has the origin in the center, so the X axis goes from -1
(left) to 1
(right), and the Y axis goes from -1
(bottom) to 1
(top).
The WebGL workflow is split between:
JavaScript is unable to parse GLSL natively, so the GLSL source code and variable names are always declared as strings and then passed to the WebGL context. This looks like a big drawback at first, however when working with Typescript, the usage of template strings will greatly improve the readability of GLSL shaders. However, for this book we will stick with JavaScript.
WebGL provides access to two kinds of shaders:
For example, if a 3D cube is drawn with WebGL:
A couple consisting of a shader of each type is called a shader program. It fully characterizes a specific rendering (i.e. a material for 3D rendering). Both shaders have the same structure:
//declaration of the I/O variables
//custom functions
//main loop
void
main
(
void
){
//main code here
ouput_variable
=
value
;
}
The output variable of the vertex shader is always gl_Position
. It is a four dimensional vector [x,y,z,w]
in clipping coordinates. The position of the point in 2D in the viewport is [x/w, y/w]
. z/w
is the depth buffer value, used in 3D rendering to manage depth overlapping. w
is the extension of 3D coordinates to 4D homogeneous coordinates, a coordinate system that facilitates affine transformations (rotation, scaling and translation). Using homogeneous coordinates an affine transformation can be written in a single matrix multiplication.
The output variable of the fragment shader is gl_FragColor
or gl_FragData
depending on the GLSL version. It is the RGBA color value of the pixel, where A
represents the alpha channel and hence manages transparency. For standard color rendering, each channel value is clamped between 0 and 1 (for example [1,1,1,1]
is opaque white, [0.5,0,0,1]
is 50% dark red and [1,0,0,1]
is opaque black).
This figure is a simplified diagram of a WebGL workflow (we have deliberately forgotten the varying GLSL I/O type because they may not be used for GPGPU). On the left, there is the data: per vertex data for the VBOs, typed arrays for the textures or JavaScript numbers. In the middle there are the GLSL I/O variables: they are the bridge between JavaScript and GLSL. They are a kind of pointer in JavaScript, and they can be directly used in the shaders. On the left there is the GPU code. The rasterization process between the shaders is automatically done by the graphic drivers.
We use the main computations in the fragment shader. We fill the viewport with two triangles and we do all of the rendering in the fragment shader. This is common practice when using the WebGL shaders for computing. Using these two triangles will make sure that we can later execute one fragment shader per pixel. If the triangles don’t cover the whole viewport, the fragment shaders are only executed for the pixels covered by the triangle.
The WebGL workflow is still quite stiff and we cannot use the shaders separately. So we have to use the vertex shader too, but only to fill the viewport with two triangles. Then the color of each pixel of the triangles is assigned in the fragment shader. Later we will replace rendering by computing. We refer to the first example of the book code repository: chapter3/0_webglFirstRendering.
We declare a JavaScript typed array containing the 2D coordinates of the four corners of the viewport:
var
quadVertices
=
new
Float32Array
([
-
1
,
-
1
,
//bottom left corner -> index 0
-
1
,
1
,
//top left corner -> index 1
1
,
1
,
//top right corner -> index 2
1
,
-
1
//bottom right corner-> index 3
]);
Then, group these four points to build two non overlapping triangles using the indices of the vertices:
var
quadIndices
=
new
Uint16Array
([
0
,
1
,
2
,
//first triangle if made with points of indices 0,1,2
0
,
2
,
3
//second triangle
]);
In this figure the two triangles fill the viewport. The top left triangle is displayed in blue. Its point indices are 0,1,2. The red bottom right triangle, which point indices, are 0,2,3.
The data is stored in JavaScript typed arrays. We send them to the GPU memory (the VRAM) by creating Vertex Buffer Object (VBO). A VBO is simply an array stored in the GPU memory:
//send vertices to the GPU:
var
quadVerticesVBO
=
GL
.
createBuffer
();
GL
.
bindBuffer
(
GL
.
ARRAY_BUFFER
,
quadVerticesVBO
);
GL
.
bufferData
(
GL
.
ARRAY_BUFFER
,
quadVerticesVBO
,
GL
.
STATIC_DRAW
);
//send indices to the GPU:
var
quadIndicesVBO
=
GL
.
createBuffer
();
GL
.
bindBuffer
(
GL
.
ELEMENT_ARRAY_BUFFER
,
quadIndicesVBO
);
GL
.
bufferData
(
GL
.
ELEMENT_ARRAY_BUFFER
,
quadIndicesVBO
,
GL
.
STATIC_DRAW
);
The vertex shader takes the quad vertices as input, and outputs them at the corners of the viewport. Here is the source code of the vertex shader in GLSL:
attribute
vec2
position
;
void
main
(
void
){
gl_Position
=
vec4
(
position
,
0.
,
1.
);
//position in clip coords
}
As you can see in the above example, we only need the x
and y
coordinates of the vertices. Hence we set the depth z=0
and w=1
.
Then we write a fragment shader, which outputs the 2D position of each pixel in the viewport in the red and green color channels. From now on, we will often use this trick to encode data in the color channels of the output pixels.
precision
highp
float
;
uniform
vec2
resolution
;
//resolution in pixels
void
main
(
void
){
//gl_FragCoord is a built-in input variable.
//It is the current pixel position, in pixels:
vec2
pixelPosition
=
gl_FragCoord
.
xy
/
resolution
;
gl_FragColor
=
vec4
(
pixelPosition
,
0.
,
1.
);
}
We declare both shaders as JavaScript strings and we compile them separately:
//declare shader sources as string
var
shaderVertexSource
=
"attribute vec2 position; "
+
"void main(void){ "
+
"gl_Position=vec4(position, 0., 1.); "
+
"}"
;
var
shaderFragmentSource
=
"precision highp float; "
+
"uniform vec2 resolution; "
+
"void main(void){ "
+
"vec2 pixelPosition=gl_FragCoord.xy/resolution; "
+
"gl_FragColor=vec4(pixelPosition, 0.,1.); "
+
"}"
;
//function to compile a shader
function
compile_shader
(
source
,
type
,
typeString
)
{
var
shader
=
GL
.
createShader
(
type
);
GL
.
shaderSource
(
shader
,
source
);
GL
.
compileShader
(
shader
);
if
(
!
GL
.
getShaderParameter
(
shader
,
GL
.
COMPILE_STATUS
))
{
alert
(
"ERROR IN "
+
typeString
+
" SHADER: "
+
GL
.
getShaderInfoLog
(
shader
));
return
false
;
}
return
shader
;
};
//compile both shaders separately
var
shaderVertex
=
compile_shader
(
shaderVertexSource
,
GL
.
VERTEX_SHADER
,
"VERTEX"
);
var
shaderFragment
=
compile_shader
(
shaderFragmentSource
,
GL
.
FRAGMENT_SHADER
,
"FRAGMENT"
);
We create the shader program, which contains all of the graphic code for a specific rendering:
var
shaderProgram
=
GL
.
createProgram
();
GL
.
attachShader
(
shaderProgram
,
shaderVertex
);
GL
.
attachShader
(
shaderProgram
,
shaderFragment
);
Finally, we link GLSL I/O variables with JavaScript pointers-like variables, which will be used later to update GLSL values from JavaScript. GLSL variable names are always given as string because JavaScript cannot understand GLSL:
//start the linking stage:
GL
.
linkProgram
(
shaderProgram
);
//link attributes:
var
posAttribPointer
=
GL
.
getAttribLocation
(
shaderProgram
,
"position"
);
GL
.
enableVertexAttribArray
(
posAttribPointer
);
//link uniforms:
var
resUniform
=
GL
.
getUniformLocation
(
shaderProgram
,
"resolution"
);
The only VBO used in GPGPU is the quad, so we can bind it to the context once and for all:
GL
.
bindBuffer
(
GL
.
ARRAY_BUFFER
,
quadVerticesVBO
);
GL
.
vertexAttribPointer
(
posAttribPointer
,
2
,
GL
.
FLOAT
,
false
,
8
,
0
);
GL
.
bindBuffer
(
GL
.
ELEMENT_ARRAY_BUFFER
,
quadIndicesVBO
);
The drawcall GL.vertexAttribPointer
details how to parse shader program attributes from VBO data. It means that the posAttribPointer
attribute (previously linked to GLSL "position"
variable) has two components that type is GL.FLOAT
. false
disables on a fly vertex normalization, 8
is the size of 1 vertice in bytes (8 bytes = 2 components * 4 bytes because a GL.FLOAT
element is stored using 4 bytes).
Let’s trigger the rendering:
GL
.
useProgram
(
shaderProgram
);
//update GLSL "resolution" value in the fragment shader:
GL
.
viewport
(
0
,
0
,
myCanvas
.
width
,
myCanvas
.
height
);
//update GLSL "resolution" value in the fragment shader:
GL
.
uniform2f
(
resUniform
,
myCanvas
.
width
,
myCanvas
.
height
);
//trigger the rendering:
GL
.
drawElements
(
GL
.
TRIANGLES
,
6
,
GL
.
UNSIGNED_SHORT
,
0
);
GL
.
flush
();
Each pixel color value is assigned in parallel by the fragment shader statement gl_FragColor=vec4(pixelPosition, 0.,1.)
. Each pixel only knows its relative position given by built-in variable gl_FragCoord
. This is the power of WebGL parallelization. The resulting pixel array materialized by the rendering is analogous to a CUDA kernel.
Replace the main
function of the fragment shader by:
void
main
(
void
){
vec2
pixPos
=
gl_FragCoord
.
xy
/
resolution
;
//translate and scale:
vec2
pixPosCentered
=
1.3
*
(
pixPos
*
2.
-
vec2
(
1.55
,
1.
));
vec2
z
=
pixPosCentered
,
newZ
;
float
j
=
0.
;
for
(
int
i
=
0
;
i
<=
200
;
i
+=
1
)
{
newZ
=
pixPosCentered
+
vec2
(
z
.
x
*
z
.
x
-
z
.
y
*
z
.
y
,
2.
*
z
.
y
*
z
.
x
);
if
(
length
(
newZ
)
>
2.
)
break
;
z
=
newZ
;
j
+=
1.
;
}
//generate RGB color from j:
vec3
color
=
step
(
j
,
199.
)
*
vec3
(
j
/
20.
,
j
*
j
/
4000.
,
0.
);
gl_FragColor
=
vec4
(
color
,
1.
);
}
We get this beautiful Mandelbrot fractal so fastly that it could be done inside a rendering loop running smoothly at 60 frames per second. You can try it on the github repository, chapter3/1_mandelBrot.
Explanations about this fractal can be found on nuclear.mutantstargoat.com. Many interesting renderings can be obtained by using only the fragment shader, including real-time raytracing implemented with the raymarching algorithm. They can be found on shadertoy.com.
Instead of drawing beautiful color gradients or fractals using the fragment shader, we will process computations. In this part, the main principles of general purpose computing with WebGL (GPGPU) are explained.
We can perform computations in parallel in two different places with WebGL: in the fragment shader or in the vertex shader. It is not convenient to process the main workload in the vertex shader because its output gl_Position
can’t be read directly or saved into a texture or another object like gl_FragColor
. It controls the position, so if some computation results have the same position in the viewport, they will overlap, and if they fall outside of the viewport they won’t be readable. Moreover, we usually deal with less vertices than output pixels and hence parallelization in the fragment shader will be much more efficient.
Furthermore, we would still end up reading or saving this result by using the output of the fragment shader. So for GPGPU we always use the vertex shader to simply fill the viewport and do the whole computational part in the fragment shader.
Instead of rendering on the HTML canvas and to the screen, we render to a so called framebuffer, which is a location within the GPU memory. In the next drawcall, use this framebuffer as a texture and read its values from another fragment shader. Although a texture is usually seen as an image stored in the VRAM, apprehend it as a 2D array with four channels per value (which are the RGBA color channels). It stores the result of a calculation step and the 4 channels are assigned according to an arbitrary convention.
This is a kind of old school GPGPU, because there was the release of GPU specific computation libraries or APIs like OpenCL or CUDA.
Popular JavaScript libraries for deep learning in the browser (such as Tensorflow.js) map whole deep learning models into the VRAM. Each layer’s output is written into a texture and fed into the next layer where it is looked up in the next fragment shader. We want to avoid reading the GPU memory from the main JavaScript thread at all costs, but rather build the complete execution graph of the deep learning model as a giant render path.
WebGL can be hard to debug because the code is executed on different supports and there are important variations between graphic hardware. Some browser extensions like WebGL inspector or WebGL Insight can help to inspect the graphic memory linked to a specific WebGL context. It is possible to browse and inspect the textures, the VBOs, or to list the drawcalls.
If you face a hardware dependant bug, you can:
If WebGL does not work anymore on a specific configuration, the graphic drivers may have been blacklisted by the browser because a security hole has been discovered. Their update should fix the problem.
GLSL syntax errors are detected when shaders are compiled. They are explicit and the line number (in the GLSL code) is specified so they are easy to fix. When working with shaders, this error is most common.
Most often these errors are reported as warnings in the standard JavaScript console. They can occur, for example, if a custom framebuffer is not bound to a texture, or if a texture is initialized from a typed array which has a different length than expected. The error message is explicit, but the JavaScript line number where there is the drawcall triggering the error is not specified.
You can set a breakpoint and analyze the error in the JavaScript console. We advise to add the GL.finish()
statement before the breakpoint in order to be sure that the GPU has executed all its pending drawcalls.
These errors are hard to analyze because it is impossible to set a breakpoint in the shaders. A specific rendering needs to be implemented to highlight the problem by displaying the simulation variables in the color channels of the standard framebuffer.
We develop a Conway game of life to illustrate the concept of render to texture for computing. WebGL and full JavaScript implementations of this simulation are included in the code repository of this book. The WebGL program is about 4000 times faster running on low level graphic hardware and a high end CPU. It is a wonderful example to illustrate the effects of parallelization on a relatively simple code. The source code is available on the Github repository, chapter3/2_renderToTexture.
We first declare the global parameters of the simulation:
var
SETTINGS
=
{
simuSize
:
256
,
//simulation is done in a 256*256 cells
//square
nIterations
:
2000
//number of computing iterations
};
A default framebuffer object (FBO) is created and bound to the context when the WebGL context is instantiated. The rendering occurs in this framebuffer which controls the display. To render to texture (RTT), we need to create a custom framebuffer object. Then we bind it to the context:
var
rttFbo
=
GL
.
createFramebuffer
();
GL
.
bindFramebuffer
(
GL
.
FRAMEBUFFER
,
rttFbo
);
This function creates a texture from a JavaScript typed array:
function
create_rttTexture
(
width
,
height
,
data
){
var
texture
=
GL
.
createTexture
();
GL
.
bindTexture
(
GL
.
TEXTURE_2D
,
texture
);
//texture filtering:
//pick the nearest pixel from the texture UV coordinates
//(do not linearly interpolate texel values):
GL
.
texParameteri
(
GL
.
TEXTURE_2D
,
GL
.
TEXTURE_MAG_FILTER
,
GL
.
NEAREST
);
GL
.
texParameteri
(
GL
.
TEXTURE_2D
,
GL
.
TEXTURE_MIN_FILTER
,
GL
.
NEAREST
);
//do not repeat texture along both axis:
GL
.
texParameteri
(
GL
.
TEXTURE_2D
,
GL
.
TEXTURE_WRAP_S
,
GL
.
CLAMP_TO_EDGE
);
GL
.
texParameteri
(
GL
.
TEXTURE_2D
,
GL
.
TEXTURE_WRAP_T
,
GL
.
CLAMP_TO_EDGE
);
//set size and send data to the texture:
GL
.
texImage2D
(
GL
.
TEXTURE_2D
,
0
,
GL
.
RGBA
,
width
,
height
,
0
,
GL
.
RGBA
,
GL
.
UNSIGNED_BYTE
,
data
);
return
texture
;
}
It is not possible to simultaneously read a texture and render to it. So we need to create two textures. Both store the cell states in the red channel only (the cell is alive if the red channel value is 1.0
and dead if it is equal to 0.0
):
var
dataTextures
=
[
create_rttTexture
(
SETTINGS
.
simuSize
,
SETTINGS
.
simuSize
,
data0
),
create_rttTexture
(
SETTINGS
.
simuSize
,
SETTINGS
.
simuSize
,
data0
)
];
data0
is a randomly initialized Uint8Array
storing the RGBA values of the texture.
During the first simulation iteration, we render to dataTextures[1]
using dataTextures[0]
, and then we swap the two textures and reiterate.
We need two shader programs:
Both shader programs have the same vertex shader, still drawing two triangles to fill the viewport. All the logic is done by the fragment shader.
This is the main function of the computing fragment shader. It implements the logic of the Conway game of life:
void
main
(
void
){
//current position of the rendered pixel:
vec2
uv
=
gl_FragCoord
.
xy
/
resolution
;
vec2
duv
=
1.
/
resolution
;
//distance between 2 texels
//cellState values: 1->alive, 0->dead:
float
cellState
=
texture2D
(
samplerTexture
,
uv
).
r
;
//number of alive neighbors (Moore neighborhood):
float
nNeighborsAlive
=
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
1.
,
1.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
0.
,
1.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
-
1.
,
1.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
-
1.
,
0.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
-
1.
,
-
1.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
0.
,
-
1.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
1.
,
-
1.
)).
r
+
texture2D
(
samplerTexture
,
uv
+
duv
*
vec2
(
1.
,
0.
)).
r
;
if
(
nNeighborsAlive
==
3.0
){
cellState
=
1.0
;
//born
}
else
if
(
nNeighborsAlive
<=
1.0
||
nNeighborsAlive
>=
4.0
){
cellState
=
0.0
;
//die
};
gl_FragColor
=
vec4
(
cellState
,
0.
,
0.
,
1.
);
}
The texture2D
statement fetches a pixel from a texture (also called a texel). Its arguments are the texture sampler and the texel coordinates. We can use around 16 textures simultaneously in one fragment shader, depending on the GPU capabilities. The exact number can be obtained by the statement GL.getParameter(GL.MAX_TEXTURE_IMAGE_UNITS)
. Of course you can instantiate many more textures but you won’t be able to use them all simultaneously. A sampler has the type uniform sampler2D
in the shaders but is assigned like an integer from JavaScript:
//At the linking step
var
_samplerTextureRenderingUniform
=
GL
.
getUniformLocation
(
shaderProgramRendering
,
'samplerTexture'
);
//...
//We affect the sampler value to channel 7, like an integer
GL
.
useProgram
(
myShaderProgram
);
GL
.
uniform1i
(
_samplerTextureRenderingUniform
,
7
);
//...
//Just before the rendering:
//we activate the texture channel 7:
GL
.
activeTexture
(
GL
.
TEXTURE7
);
//we bind myTexture to the activated channel:
GL
.
bindTexture
(
GL
.
TEXTURE_2D
,
myTexture
);
The second parameter of the GLSL statement texture2D
are the textures coordinates, also called UV coordinates. It is a vec2
instance, which locates the fetched texel. Its components are both between 0.0
and 1.0
.
Let’s prepare the simulation step:
GL
.
useProgram
(
shaderProgramComputing
);
GL
.
viewport
(
0
,
0
,
SETTINGS
.
simuSize
,
SETTINGS
.
simuSize
);
And launch the simulation loop:
for
(
var
i
=
0
;
i
<
SETTINGS
.
nIterations
;
++
i
){
//dataTextures[0] is the state (read):
GL
.
bindTexture
(
GL
.
TEXTURE_2D
,
dataTextures
[
0
]);
//dataTextures[1] is the updated state (written):
GL
.
framebufferTexture2D
(
GL
.
FRAMEBUFFER
,
GL
.
COLOR_ATTACHMENT0
,
GL
.
TEXTURE_2D
,
dataTextures
[
1
],
0
);
GL
.
drawElements
(
GL
.
TRIANGLES
,
6
,
GL
.
UNSIGNED_SHORT
,
0
);
dataTextures
.
reverse
();
}
The instruction GL.framebufferTexture2D(...)
means that everything that is drawn on the current bound framebuffer (which is rttFbo
) is also drawn into the texture dataTextures[1]
.
Finish by the rendering step:
//come back to the default FBO (displayed on the canvas):
GL
.
bindFramebuffer
(
GL
.
FRAMEBUFFER
,
null
);
GL
.
useProgram
(
shaderProgramRendering
);
GL
.
viewport
(
0
,
0
,
myCanvas
.
width
,
myCanvas
.
height
);
//[...]
//trigger the rendering:
GL
.
drawElements
(
GL
.
TRIANGLES
,
6
,
GL
.
UNSIGNED_SHORT
,
0
);
GL
.
flush
();
This figure is a Conway game of life simulation. We have changed the cell initialization values from the repository code to start with a square of living cells at the center of the simulation area (left image). After 2000 iterations, the complexity emerges (right image).
We only use discrete values in the Conway game of life: the cell is either alive or dead. But if you need to use continuous values, you would be limited by the WebGL default eight bits precision. Indeed, each component of gl_FragColor
is encoded using eight bits, and is clamped between 0
and 1
. There are only 2^8 = 256
possible values for each component. This is far enough to encode a color because human eye is not accurate enough to distinguish between two colors differing at one bit. But we need higher precision for deep learning models because we usually deal with floating-point values. 16 bits (GL.HALF_FLOAT
) would be enough, but on some configurations only 32 bits precision (GL.FLOAT
) is available. These capabilities are required:
FLOAT
or HALF_FLOAT
texturesFLOAT
or HALF_FLOAT
texturesIf you use WebGL1, you need to enable OES_TEXTURE_FLOAT
or OES_TEXTURE_HALF_FLOAT
extensions (which are not always implemented). Then you still need to test render to FLOAT
or HALF_FLOAT
textures.
If WebGL2 is used, these requirements are met because FLOAT
and HALF_FLOAT
have been included in the specifications. But only render to texture to HALF_FLOAT
texture is specified. WebGL2 is backward compatible with WebGL1 and can be used by:
var
GL
=
myCanvas
.
getContext
(
'webgl2'
,
...);
At the initialization of the context, you should:
HALF_FLOAT
precisionIf WebGL1 is available:
OES_TEXTURE_FLOAT
extensionFLOAT
texturesOES_TEXTURE_FLOAT
extension is not available or if RTT does not work:OES_TEXTURE_HALF_FLOAT
extensionOn some GPU configurations, we also have to get the <WEBGL|EXT|OES>_color_buffer_float
extension. Otherwise it is not possible to bind a framebuffer object to a FLOAT
or HALF_FLOAT
texture.
The precision used in a shader is specified in the first line by:
precision
highp
float
It accepts three values:
lowp
: computations are done with eight bit precision. It is fast, but this is not precise enough for floating-point number calculations. It still suits for rendering color values.mediump
: highp
, lowp
or another precision between the two is used depending on the GPU configuration. It is important to check the real precision using GL.getShaderPrecisionFormat(GL.MEDIUM_FLOAT)
before using it because this level of precision can vary a lot depending on the graphic hardware.highp
: floating-point numbers are manipulated using 16 or 32 bits. 16 bits are enough for deep learning computations (but not always for physical simulations...). The real precision for this level can be checked by running GL.getShaderPrecisionFormat(GL.HIGH_FLOAT)
.The returned value of GL.getShaderPrecisionFormat(<level>)
is an object that the precision
property equals to the number of bits encoding the fraction of a floating-point number in the shaders. For example, for a 32 bits floating-point number it is 23, for a 16 bits floating-point number it is 10.
Start from the Conway game of life to develop a heat simulation, involving render to floating-point textures. You can test it on the Github repository, chapter3/3_RTTfloat. We simulate in 2D a square part of iron, which side measures 2.56 meters heated at 100°C, surrounded by iron at 0°C. The heat diffuses over time.
In this figure, on the left: a simulation in the initial state, and on the right: a simulation after 2000 seconds. Colors are defined by the IDL_Rainbow color map implemented in the fragment shader and calibrated from 0°C to 100°C.
Implementing a deep learning network with WebGL is not easy and it introduces a lot of complexity due to the specific workflow and the different execution paths depending on the graphic hardware. The only goal is to increase the performance, and that would be a shame to undermine this effort by falling into common graphic programming pitfalls.
Futhermore, speed really matters in many cases. Computer vision problems, such as an image classification, segmentation, or object detection, often run on continuous video streams where we have around hundred milliseconds to run at least at 30 Frames Per Second (FPS). Therefore, you should follow a few rules on basic shader optimizations.
As you can imagine, GLSL development can be the subject of an entire book (the most popular ones cover more than 1.000 pages!). We will only deal with most common and penalizing development mistakes. If you want to go further with WebGL and GLSL learning you can try the free interactive tutorials of http://webgl.academy.
When possible, avoid conditional statements like if...then...else
in the shaders. Consider this code computing an exponential linear unit (ELU) activation function:
float
ELU
(
float
x
){
if
(
x
>=
0.0
){
return
x
;
}
else
{
return
exp
(
x
)
-
1.0
;
}
}
It uses an if
statement. This code is more efficient:
float
ELU
(
float
x
){
return
mix
(
exp
(
x
)
-
1.0
,
x
,
step
(
x
,
0.
));
}
The GLSL built-in mix()
function is defined by: mix(x,y,a)=x*(1-a)+y*a
.
Use several small shaders instead of one large shader. If the shader is too long, the GPU execution cache will have to be updated during execution time, and this is penalizing.
In a high precision shader, floating-point numbers are stored using 16 to 32 bits, depending on the GPU. The first bit determines the sign, then for a 32 bits float, 8 bits encodes the exponent and 23 bits the fraction. Because of this storage format, the maximum value of a 32 bits encoded float is 3.4e38
, and the minimum is -3.4e38
. For 16 bits floating-point numbers, this range is narrower.
This figure is a 32 bit standard floating-point number binary encoding. Source: Wikipedia.
If the result of a computation above is this maximum value, it is replaced by a floating-point special, +Infinite
. The handling of floating-point specials like +Infinite
, -Infinite
, NaN
, depends on the GPU hardware. Then this value may propagate along the computation stream.
Floating-point specials can be stored on FLOAT
or HALF_FLOAT
textures. So if we render to texture, these values may propagate over the following calculation passes.
Floating-point specials appear even if the calculation step is implicit. We consider the GLSL function computing the ELU activation function:
float
ELU
(
float
x
){
return
mix
(
exp
(
x
)
-
1.0
,
x
,
step
(
x
,
0.
));
}
If x=100
, which is not unrealistic (x
can be the summed weighted inputs of the neuron), we get ELU(100) = 100
. But the GPU has to compute ELU(100) = mix(exp(100)-1, 100, 1) = (exp(100)-1) * 0 + 100
. But exp(100) = 2.7.10^{43}
is above the maximum floating-point number value. So it is replaced by a floating-point special, +Infinity
. And exp(100)-1 = +Infinity-1 = +Infinity
. The GPU compute ELU(100) = +Infinity * 0 + 100
. But +Infinity * 0
is undefined and it generates another floating-point special, NaN
. NaN
always propagates along the computation flow because any operation involving NaN
outputs NaN
(On the contrary Infinity
can disappear because 1/Infinity = 0
). The GPU outputs ELU(100) = NaN + 100 = NaN
. Then all connected neurons of the next layer receive a NaN
value among their inputs, and will output NaN
too and so on...
There are some solutions to avoid floating-point specials:
mix()
or other GLSL interpolation function is called, make sure that both terms are small enough. This implementation of ELU is safer:float
ELU
(
float
x
){
return
mix
(
exp
(
-
abs
(
x
))
-
1.0
,
x
,
step
(
x
,
0.
));
}
All specifications about floating-point specials on Nvidia GPUs can be found on the Nvidia Website.
A GPU has a hierarchical cache system like a CPU. When a texel is fetched with a texture2D()
statement in the fragment shader, the GPU searches first in the faster, but smaller levels of cache. If it is not here, it triggers a cache miss and searches in the second cache, larger but slower, and so on until it fetches it directly in the highest level of texture cache. Then the texel is copied in a lower level of cache, and will be faster to retrieve for the next lookups.
If neighboring pixels of the rendering reuse the same texels, the cache misses rate is low and the shader runs fast. But if they use many different texels, so many cache misses will occur, and the performance will strongly decrease. In some cases there are several ways to arrange the data in a texture. One of the alternatives is often better optimized for this aspect.
In this figure you see the graphic memory, where texels are not aligned but sorted with the Morton order. This process is called texture swizzling. When a texel is fetched, the whole line of texels (displayed in red) is pushed to a low level of the GPU memory cache. So the 2D neighboring texels will be faster to fetch, because they are already in the low levels of memory cache.
Although there are several publications about texture caches (Texture Caches, Michael Doggett, Lund University), the cache implementation depends a lot on the hardware. It is still necessary to test WebGL programs with a sample of different hardware to make sure that it is fast enough and runs everywhere.
Render to texture is defined for four color channels textures and four color channels framebuffers. They can be used in different ways:
gl_FragColor
.The WebGL state can be modified using GL.enable()
or GL.disable()
instructions. Their effect is retentive so the state of the WebGL context needs to be changed once. Dithering should be disabled when the fragment shader is used for computing, because it may alter the output values. It consists in applying some noise to the fragment shader output to avoid color quantization visual artifacts.
GL
.
disable
(
GL
.
DITHER
);
In this figure you see how dithering may alter color pixel values in the framebuffer if it has lower precision than the computed colors. On the left: a picture of a cat in 32 bit RGBA colors. In the middle: a picture of the cat in 2 bit indexed RGBA color. Color precision has decreased and each pixel is replaced by the nearest color in the new color palette. On the right: a picture of the cat with the same 2 bit indexed RGBA color, but with dithering enabled. Pixel colors are no longer the nearest color in the new color palette: some noise has been introduced in order to avoid the creation of borders.
It is far more efficient to instantiate one framebuffer and to dynamically bind a texture on it using GL.framebufferTexture2D
each time we need to render to a specific texture than instantiating one framebuffer per texture, to bind each texture to a different framebuffer, and to dynamically switch the framebuffer when we need to render to the bound texture using GL.bindFramebuffer
. Indeed, GL.bindFramebuffer
is very costly in terms of performance.
Rather than drawing two triangles filling the viewport, we can draw only one triangle larger than the viewport (its corners can be [-1,-1]
, [3,-1]
, [-1,3]
). It is slightly faster. The fragment shader is still executed once per pixel thanks to the scissor test, integrated in the WebGL graphic pipeline and enabled by default.
In the beginning, the data are processed by JavaScript, whether it is images, video, or arrays. So they are stored in the RAM and in the CPU registers, and they are handled by the CPU. Then they are sent on the GPU and processed with the shaders. At the end, we will probably need them on the CPU side again. So we need to transfer the data from the CPU to the GPU, then do the opposite.
floating-point textures are often initialized from JavaScript data. For example, if synaptic weights are stored in a texture, we fill it from a JavaScript array of values calculated during a previous training or matching a specific random distribution. If the texture stores GL.FLOAT
elements (using 32 bits per float), the initialization is straightforward:
//small variation between WebGL1 and 2:
var
internalPixelFormat
=
(
ISWEBGL2
)
?
GL
.
RGBA32F
:
GL
.
RGBA
;
GL
.
texImage2D
(
GL
.
TEXTURE_2D
,
0
,
internalPixelFormat
,
<
width
>
,
<
height
>
,
0
,
GL
.
RGBA
,
GL
.
FLOAT
,
<
instance_of_Float32Array
>
);
But the initialization from an array is more difficult if the texture stores GL.HALF_FLOAT
elements, because there is no JavaScript Float16Array
type. On JavaScript side, we have to encode the floating-point values with 16 bits per value using the standard float16 encoding (one bit for the sign, five for the exponent and 10 for the fraction). Then we have to put the encoded values into a JavaScript Uint16Array
. The encoding function from JavaScript Float32Array
to Uint16Array
is provided in the code repository of this book (see RTTfloat
heat simulation). Then we initialize the texture with the Uint16Array
:
//see "continuous simulation" example to see the code
//of the "convert_arrayToUInt16Array" function:
var
u16a
=
convert_arrayToUInt16Array
(
<
instance_of_Float32Array
>
);
var
internalPixelFormat
=
(
ISWEBGL2
)
?
GL
.
RGBA16F
:
GL
.
RGBA
;
GL
.
texImage2D
(
GL
.
TEXTURE_2D
,
0
,
internalPixelFormat
,
<
width
>
,
<
height
>
,
0
,
GL
.
RGBA
,
GL
.
HALF_FLOAT
,
u16a
);
Unless we have a 100% GPU workflow, at some point we need to get back our computation data to JavaScript. This operation is slow and it should be done as little as possible. We have to render into the default framebuffer (displayed on the <canvas>
element), then to read the pixels using the instruction GL.readPixels
. It fills a Uint8Array
JavaScript typed array with the interleaved RGBA values of the pixels of a rectangular area. The slowness comes from the forced synchronization between the CPU and the GPU. Before using it we should create the WebGL context with the option preserveDrawingBuffer: true
.
Reading eight bits encoded values is straightforward: the texture is rendered by a fragment shader, which copies the texture value to gl_FragColor
. Then framebuffer values are read using GL.readPixels
. But for floating-point value textures this is more complicated. We need to develop a specific fragment shader to pack each floating-point value into several eight bits values, then process several renderings with staggered viewports, read the renderings with one GL.readPixels
drawcall and finally reconstruct the floating-point array from the Uint8Array
with JavaScript.
The shader packing floats into eight bits color values is provided in the book repository, with an example of floating-point texture reading.
After this ride into the fascinating world of WebGL GPGPU, we come back to deep learning to build a minimalist WebGL linear algebra library called WGLMatrix
. Then we will use it to implement a simple neural network (not convolutional) learning to recognize handwritten digits from the MNIST dataset. A test of the first version of the linear algebra library is included in chapter3/4_WGLMatrix.
There are built-in matrix types in GLSL only for matrices with largest dimensions less or equal to four. This is not enough for deep learning. So matrices are stored using textures and we need to develop specific shaders for common matrix operations. Each texel is a matrix entry, and the texture has the same resolution than the matrix dimensions.
When a matrix is created, if four JavaScript initialization arrays are provided, the RGBA channels are filled with them. Otherwise, if only one array is provided, its values are duplicated four times to fill the color channels. Each common matrix operation is done separately for the four RGBA channels. It is as if we process it four times in parallel with four different matrices.
The addition fragment shader does not depend on the matrix size like all other elementwise operators. This is its GLSL code:
void
main
(
void
){
vec2
uv
=
gl_FragCoord
.
xy
/
resolution
;
vec4
matAValue
=
texture2D
(
samplerTexture0
,
uv
);
vec4
matBValue
=
texture2D
(
samplerTexture1
,
uv
);
gl_FragColor
=
matAValue
+
matBValue
;
}
Unlike the addition shader, the multiplication shader involves a for
loop to iterate over a row of the first matrix and simultaneously over a column of the second matrix. With WebGL1, it is not possible to use non-constant values among for
conditions. This restriction includes GLSL uniforms values or previous computation results. So, we need to compile one shader program per matrix common multiplication dimension. For example, this shader can be used for all multiplications between a (n, 10)
matrix and a (10, m)
matrix:
//vector between 2 consecutive texels of first factor:
const
vec2
DU
=
vec2
(
1.
/
10.
,
0.
);
//vector between 2 consecutive texels of second factor:
const
vec2
DV
=
vec2
(
0.
,
1.
/
10.
);
void
main
(
void
){
vec2
uv
=
gl_FragCoord
.
xy
/
resolution
;
vec2
uvu
=
uv
*
vec2
(
1.
,
0.
);
vec2
uvv
=
uv
*
vec2
(
0.
,
1.
);
vec4
result
=
vec4
(
0.
,
0.
,
0.
,
0.
);
for
(
float
i
=
0.0
;
i
<
10.0
;
i
+=
1.0
){
result
+=
texture2D
(
samplerTexture0
,
uvv
+
(
i
+
0.5
)
*
DU
)
*
texture2D
(
samplerTexture1
,
uvu
+
(
i
+
0.5
)
*
DV
);
}
gl_FragColor
=
result
;
}
We add 0.5
to i
to pick the middle of the pixel, otherwise we may have a rounding error for some specific matrix dimensions. With WebGL2, non-constant for
loop conditions have been implemented. But this is still not acceptable to only support WebGL2 for commercial applications (the support rate is 41% in March 2018, source: webglstats.com). So we build a WebGL1 shader, which also works for WebGL2.
We often have to simultaneously multiply and add matrices. For example, when we multiply a neuron layer inputs X
by the weights matrix W
, and then we add the biases B
to compute the summed inputs Z
: Z = WX + B
. We should compile a specific shader to do this operation which is called FMA (Fused Multiply–Accumulate). It will save some render to texture passes.
Our matrix library, WGLMatrix
, manages a dictionary of multiplication and FMA shader programs. If we need to process two matrices, which common dimension is included into the dictionary, we just use the shader program from the dictionary. Otherwise a new dimension specific shader program is compiled and added to the dictionary.
A shader is dedicated to the application of the activation function. We should be careful of floating-point specials, especially if the activation function involves exponentials or logarithms. This shader applies a sigmoid activation function:
const
vec4
ONE
=
vec4
(
1.
,
1.
,
1.
,
1.
);
void
main
(
void
)
{
vec2
uv
=
gl_FragCoord
.
xy
/
resolution
;
vec4
x
=
texture2D
(
samplerTexture0
,
uv
);
vec4
y
;
y
=
1.
/
(
ONE
+
exp
(
-
x
));
gl_FragColor
=
y
;
}
Because many specific activation functions may be applied to a matrix, we set a public method in WGLMatrix
to compile custom shaders from outside the library.
Each matrix should be initialized before use. Matrices m,v,n
are initialized from their flattened values:
// encoding 3*3 matrix | 0 1 2 |:
// | 3 4 5 |
// | 6 7 8 |
var
M
=
new
WGLMatrix
.
Matrix
(
3
,
3
,[
0
,
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
]);
//V is a column matrix, which is a vector:
var
V
=
new
WGLMatrix
.
Matrix
(
3
,
1
,[
1
,
2
,
3
]);
var
W
=
new
WGLMatrix
.
MatrixZero
(
3
,
1
);
Mathematically, there is no difference between a vector and a matrix, which width equals 1. After initialization we can apply operations on matrices. To execute the operation OPERATION
on matrix A
we run:
A
.
OPERATION
(
arguments
...,
R
)
R
is the matrix where the result is put. We cannot store the result of an operation in any matrix involved in it because this is not possible to simultaneously read a texture and render to it. The operation always returns the result matrix R
.
For example, to process the operation W = M*V
run:
M
.
multiply
(
V
,
W
);
And it returns the W
matrix.
We can declare a custom elementwise operation by executing:
WGLMatrix
.
addFunction
(
'y=cos(x);'
,
'COS'
);
The first argument is the GLSL code of the function using transforming pre-defined vec4 x
to vec4 y
. The second argument is the user-defined identifier of the function. Then to apply it to all elements of a matrix M
execute: M.apply('COS', R)
where R
is the matrix receiving the result.
We have added some computation methods to WGLMatrix in order to implement all of the matrix operations required for training and running neural networks. We have also added the verification of dimensions of the matrices in order to prevent invalid operations.
The dataset is loaded on the GPU as input vectors encoding digit images and expected outputs. It uses the overwhelming majority of graphics memory. These data does not need to be stored with 16 or 32 bits precision: eight bits are far enough because input images are already encoded with eight bits per channel and output vectors are binary. We have added eight bits precision support to WGLMatrix library.
The dataset loading is done using the mnist_loader.js
script. At line 50 we add a new [X,Y]
pair either in the training or in the test dataset depending on its index:
targetData
.
push
([
new
WGLMatrix
.
Matrix
(
784
,
1
,
learningInputVector
),
//X
new
WGLMatrix
.
Matrix
(
10
,
1
,
learningOutputVector
)
//Y
]);
After this step the whole dataset is loaded into the graphic memory as textures.
It is important to be able to predict the required graphic memory because if we try to allocate more memory than available, the WebGL context is killed and the application crashes. In our case we load 60000
handwritten digit images, so 60000 * 28 * 28approx47e6
texels. Each texel is stored in 4 RGBA channel and each color channel is encoded using eight bits (= 1 byte ). We need 47e6 * 4 * 1=188e6
bytes for the input vectors, so about 200MB. For the output vectors we need 60000 * 10 * 4 * 1 approx 2.4e6
, so only 2.4MB.
But the GPU does not store the textures in a raw pixel format. Each texel is encoded according to its 2D neighborhood so there are strong edge effects. In our case, instead of using around 200MB of graphic memory, the whole dataset requires almost 4GB of graphic memory.
In this figure you see Nvidia GPUs, where the Nvidia settings panel is a handful to monitor the GPU memory and occupancy rate. After the loading of the whole MNIST dataset you see that 93% of the graphic memory is used.
If we change the shape of input vectors from (784, 1)
to (28, 28)
, their raw memory size is the same because their support textures have the same number of texels ( 784*1 = 28*28
). But this time the whole MNIST dataset occupies only around 280MB of the graphic memory, so more than 10 times less. Indeed, a one pixel wide texture like our (784, 1)
input vectors is not stored efficiently because no texel has a complete 2D neighborhood. The memory occupied by the square textures is still above the theoretical value because the memory is paginated (or divided in blocks which are indivisible) and there are still edge effects since these textures are small.
To optimize GPU texture compression, we should:
If we allocated only a few very large square textures, we would have little edge effects and few partially empty memory blocks. The actual occupied memory size will get closer to the theoretical value. To keep our implementation simple, we do not take account of these improvements and we always use textures which resolution is the same as the dimensions of the encoded matrix.
This is the sigmoid activation function declaration in network.js:
WGLMatrix
.
addFunction
(
'y=1./(ONE+exp(-x));'
,
'ACTIVATION'
);
This is the feedforward part written in network.js:
self
.
feedforward
=
function
(
a
){
//Return the output of the network if ``a`` is input.
for
(
var
i
=
0
,
inp
=
a
;
i
<
self
.
_nConnections
;
++
i
){
//from input to output layer,
// compute WI+B and store the result in _z:
self
.
weights
[
i
].
fma
(
inp
,
self
.
biases
[
i
],
self
.
_z
[
i
]);
//apply actFunc to _z and store result to _y
self
.
_z
[
i
].
apply
(
'ACTIVATION'
,
self
.
_y
[
i
]);
//set input for the next iteration
inp
=
self
.
_y
[
i
];
}
return
inp
;
}
We have transcoded a MNIST classifier written in Python/Numpy to JavaScript/WGLMatrix. The code can be found in chapter3/5_MNIST. This example comes from Chapter 1 of Michael Nielsen’s online book, Neural networks and Deep learning. It is described on: http://neuralnetworksanddeeplearning.com/chap1.html. The neural network of the classifier is shallow. It has only three neuron layers:
It is densely connected and the activation function is the sigmoid function. This network is trained over 30 epochs with eight samples per minibatch and a learning rate of 3.0 (per minibatch). There are 50000 samples in the training dataset and 10000 in the testing dataset. The best success rate on test data is 95.42% at epoch 27.
These are the benchmarks:
Note: matrix operations processed by Numpy are done by lower level linear algebra libraries like BLAS or LAPACK. They use the CPU very efficiently with linear algebra advanced features like SIMD instructions. They are much faster than if they were done by native Python functions.
This benchmark is not very good for our implementation. But we will surpass the Numpy implementation with a few improvements.
You can test the improved version of learning on MNIST dataset on chapter3/6_MNISTimproved.
We do not use the RGBA channels at all in the previous implementation. Operations are uselessly duplicated over the four channels. We will use them separately in order to process four input vectors in parallel. This is doable because the size of a minibatch is a multiple of four (eight in the previous example). For testing data, we pack each four input/output vectors into one input/output vector by multiplexing over RGBA channels. The same execution time for our WebGL implementation drops from 942 seconds to 282 seconds. This is now better than Python/Numpy.
But the GPU utilization rate is only around 30% because textures encoding matrices are too small. Indeed, there is a bottleneck if we render to a texture which size is smaller than the number of GPU computing units: there is not enough work and some computing units stay idle.
So we need to render to larger textures. There are several solutions to overcome this problem:
We consider another network to learn the MNIST dataset:
We train it over 20 epochs with eight samples per minibatch with a learning rate of 1.0. Now the GPU works at 65% on average. Our implementation execution time is 344 seconds whereas the Python/Numpy implementation execution time is 1635 seconds (and the best success rate on test data is 96.45% at epoch 18). So our implementation is almost five times faster. We could still use the GPU more intensively, up to a 100% use rate, by increasing the hidden layers sizes. The performance ratio between our implementation and the Python one would climb higher. But we won’t get better results on test data because of overfitting problems (we should implement regularization or dropout algorithms to solve them, or implement convolutive connectivity).
With a low end GPU (the laptop used for benchmarks has also a low end Intel HD4600 GPU), the same learning takes 587 seconds. This is still almost three times faster than the CPU implementation.
We started this chapter by a minimal implementation of WebGL to draw two triangles to fill the viewport. The learning curve of WebGL is very steep at first and this is the most difficult part. WebGL appears as heavy plumbing where we need to write a lot of code and to create many different objects before rendering. But the more we practice it, the more it seems logical and well conceived. After the first rendering, the learning becomes incremental.
Then we use WebGL to compute in parallel pixel colors forming a beautiful Mandelbrot fractal. The rendering is fast despite each pixel require a quite heavy computation. Subsequently, we master this power to accelerate our calculations. We implement render to texture to get the result of a computation step to use it at a next step instead of directly displaying it. We implement our first WebGL simulation: the Conway game of life.
We tackle the floating-point computation problems. WebGL is designed for low precision computations because the RGB colors values do not need to be encoded with more than eight bits per component. So we need to consider several execution paths depending on hardware capabilities and WebGL version. We apply this new know-how to transform our previous simulation into a thermal diffusion simulation, dealing with physical continuous variable instead of discrete parameters.
We build our own linear algebra library, which can handle all standard floating-point values matrix operations. We put it to the test with the handwritten digit recognition using the MNIST dataset. Finally, we optimize the implementation to make it faster than the Python/Numpy equivalent.
This chapter is useful both to improve the existing WebGL implementations or to build a custom one. Whatever the software overlays used above WebGL, it is not possible to optimize correctly or to add functionalities efficiently without understanding the underlying mechanisms. It can also be a starting point toward the infinite world of real time hardware accelerated rendering.
In the next chapter we will look at how to extract data from the browser, such as loading image data from URLs, parsing the frames from the webcam, or parsing the data from the microphone.
18.188.179.48