General purpose GPU computations the old fashioned way.
1st November 2024
>
Graphics processing units (GPUs) map geometry, lighting, and other data into pixels on screen.
This naturally involes quite a bit of maths, and is on paper a tought problem. There are \(x\times y \times 4\) float values representing
RGBA colour data. Generating these images is going to be \(O(n^2)\) naively
(leaving out the number of objects in a scene from the scaling), which is done
many times a second. GPUs solve this problem by being specifically designed to compute these colour
data in parallel. That is for a big enough GPU (or small enough image) in theory the problem is now \(O(1)\).
Actually doing the computations to compute the colours will scale with the objects in the scene, and the use of
any physicaly based rendering of course. But for the simplest problem of writing varying colour values to
a texture and displaying them, this can be done in the embarrassingly parallel way.
If GPUs can be used to process images in this way, why not other things? This question gives rise to general
purpose GPU computations (GPGPU). If a colour in a texture instead represents the position of a particle
RGB \(\equiv (x, y, z)\), suddenly all that mapping "geometry, lighting, and other data into pixels on screen"
can instead be used to map particle positions into new ones by applying e.g Newton's laws. This is the basis
for GPU side particle effects. This process can involve any numerical data and any equation,
physics, statistics, whatever. This is basis for GPU usage in deep learning, large language models and the like,
but also science generally.
Nowadays we have vendor-langs like CUDA from nVIDIA and ROCm from AMD that are targeted at
general purpose programming and not graphics programming. There are also open standards such as OpenGL and Vulkan, or OpenCL and OpenAcc
including compute shaders or designed for general computations respectively. Back in the day though anyone wanting to
use a GPU to accelerate their calculations generally used something like OpenGL (a graphics language)
to "hack" the general purpose computations. This really took off in the late 90s and early 2000s when OpenGL
and the like moved towards flexible shader programming with floating point textures [1, 2]. A lot of the effort was
in representing big (often sparse) matrices on GPUs using textures, and developing efficient
algorithms using the limited GPU instructions sets. Which surprise surprise is the workhorse of
LLMs and "AI" generally too.
Despite the advent of compute specific languages we can still use good old vertex and fragment shaders to
perform computations in OpenGL. Simply put we pack numerical data into a float texture, setup rendering
for a 1-to-1 mapping of fragment to texel (say with a quad and viewport the size of the texture), use a pass
through vertex shader, and perform computations in the fragment shader "sampling" textures as array access. Let's
play with an example.
The Kuramoto model [3] is a simple model of networked interacting oscillators. Meaning spinny things that
influence eachothers' spinning, but not all need to interact with all. Mathematicall we have a matrix to
represent if spinner \(j\) is connected to spinner \(i\) and how influential \(j\) is on \(i\) called
\(K_{ij}\). Each spinner normally spins at the frequency \(\omega_i\), and we write the model as,
$$ \large{\frac{d\theta_i}{dt} = \omega_i + \sum_{j\neq i} K_{ij} I(\theta_j-\theta_i)}.$$
Where the function \(I\) is the interaction kernel. This typically is an expansion of a Fourier series,
$$\large{I(\phi) = \sum_{k=1}^{K} a_k\sin(k\phi+b_k)}.$$
This is easy to put on a GPU. We have a single texture for the \(\theta_i\), we access it in a fragment shader
and apply the update rule, then copy/swap the output texture to make the new \(\theta_i\). This can be expanded to
include many different parameters and randomness but the simplest iteration of the shaders are,
#version 330
precision highp float;
precision highp int;
layout(location = 0) in vec4 a_position;
out vec2 o_texCoords;
void main(){
// pass through the texture coordinates.
gl_Position = vec4(a_position.xy,0.0,1.0);
o_texCoords = a_position.zw;
};
#version 330
precision highp float;
precision highp int;
in vec2 o_texCoords;
layout(location=0) out float output;
uniform highp sampler2D theta;
// number of elements, and neighbours to interact with.
uniform int n; uniform int s;
// model parameters, strength of interaction, timestep and constant \omega.
uniform float k; uniform float dt; uniform float o;
// coefficients and shifts for I.
uniform vec4 coef; uniform vec4 shift;
void main(){
// get the ij indices of this spinner (remember this is all in parallel.)
int i = int(o_texCoords.x*float(n)); int j = int(o_texCoords.y*float(n));
// get the current angle for ij.
float ijtheta = texture(theta, o_texCoords).r; float dtheta = 0.0; float count = 1.0;
// loop over all neighbours up to the s-th encasing shell.
for (int ix = -s; ix <= s; ix++){
for(int iy = -s; iy <= s; iy++){
int ni = int(mod(ix+i,n)); int nj = int(mod(iy+j,n));
ni = clamp(ni, 0, n); nj = clamp(nj, 0, n);
vec2 ntexCoords = vec2(float(ni)/float(n), float(nj)/float(n));
// get the neighbour's angle.
float phi = texture(theta, ntexCoords).r-ijtheta;
// accumulate the interaction.
dtheta += k*kernel(phi, coef, shift);
count += 1.0;
}
}
// update ij!
output = ijtheta+dt*(omega+dtheta/count);
};
Where the kernel has been omitted (but truncated to the fourth term using the uniforms coef and shift).
We are also using the connectivity as total up to all neighbours in the \(s\)-th shell (with periodic bounds). For \(s=1\) this means
the 8 neighbours encasing one spinner. This gives us a "basic Kuramoto"
We can go a lot further by introducing a static noise texture (a seed) and using psuedo random numbers sampled from it. Using this we can have random
connections strengths or random "dropped" connections maybe even based on distance etc. For example we could add random strengths like so,
Varying all of these parameters we can get some strange patterns!
One actual advantage of doing it this way is that it gives GPGPU on the widest range of platforms (probably), old devices only need to support floating point textures with OpenGL 3 and
it should work without being vendor specific like CUDA or ROCm. But it is hacky! You can also use compute shaders in OpenGL 4, however
if you like hacky approaches checkout the code for the Kuramoto Lamp
[1] Krüger, J. and Westermann, R., 2005. Linear algebra operators for GPU implementation of numerical algorithms. In ACM SIGGRAPH 2005 Courses (pp. 234-es). https://dl.acm.org/doi/10.1145/882262.882363
[2] Kass, M. and Miller, G., 1990, September. Rapid, stable fluid dynamics for computer graphics. In Proceedings of the 17th annual conference on Computer graphics and interactive techniques (pp. 49-57). https://dl.acm.org/doi/10.1145/97880.97884
[3] Kuramoto, Y., 1975. International symposium on mathematical problems in theoretical physics. Lecture notes in Physics, 30, p.420. https://link.springer.com/chapter/10.1007/BFb0013365