Sunday, November 28, 2010

OpenCL, first iteration

OpenCL is a general purpose computing framework. It is designed for parallel computing so it runs very well in modern GPUs, which are parallel by nature. The whole idea is to have a single "job" executed by a large number of threads at the same time.

Voxels are a natural fit for OpenCL. Many voxel algrorithms run locally, meaning that the algorithm only needs to look at one or a few voxels at a time. There are no dependencies between a voxel and other voxels far away.

I already had an implementation of Dual Contouring for the CPU which would be a good start for a port to OpenCL. A straight port was out of the question, almost in every case OpenCL will require you to re-think your problem from a new angle. Still, having the traditional CPU implementation gave me a good idea of the issues I needed to address.

This first iteration was only to test the waters. I wanted to see if a naive approach would produce acceptable results. The bare minimum I could do was:

  1. For each point in the voxel grid, evaluate the 3D field function
  2. For each edge in the grid, see if the edge crossed the surface and compute the intersection point
  3. For each voxel in the grid, see if its edges had an intersection and compute the inner point for the voxel

I could then create the resulting mesh in the CPU. I would need to iterate through all the edges again, if an edge was crossed, I would output a quadrilateral containing the inner points for the four neighboring voxels.

From the start I saw this would require to run three different "jobs" or "kernels" as they are called in OpenCL-speak. One to evaluate the points in the grid, another for intersecting edges, and one last for computing inner points.

A kernel is like a traditional C function that takes multiple parameters. Some parameters are inputs, some are outputs. In most cases, arrays are used to move data in and out.

For instance, the first kernel for computing the grid function would take an array of floats as an input. Then it would compute one single value for the function and put it into the array, making sure the value was put in the correct index. Why only one value? This is where the parallel way of thinking comes in: this same kernel will run for each point of the array, actually hundreds of them may be running at the same time. OpenCL will spawn as many instances of your kernel as necessary so the entire array can be covered. 

It did not take long to implement this version. The results were good, but it was only a few times faster than the CPU implementation. I knew this was not good enough. So what was the problem with the naive approach? It was not difficult to see it was wasting a lot of processing power.

The CPU implementation, being serial by nature, had the chance to skip further processing edges, voxels and points when it realized they were not crossed by the surface. For instance, it would compute the inner point of a voxel only if any of the voxel edges was intersected by the surface. If there was no intersection, the next CPU cycles would be spent processing the next voxel. 

The key in the sentences before is the word "If".

With OpenCL and parallel computing in general, "if" statements won't save you any processing, unless you can guarantee the same outcome for every time the if statement is evaluated. Since every thread is running the same code in parallel, it saves very little when you realize that no further processing is needed and the function can just bail out. It is already too late because a thread was allocated to the task anyway. 

The naive approach was doing an awful amount of work that was not needed at all. If I wanted really complex field functions with many octaves of Perlin noise, or distance fields to point clouds for rocks and vegetation, I would have to do better. 

I was also solving Quadratic Error Functions to compute the location of the points inside the voxels. This required normal vectors for each intersection point. The way to compute these normals is by using the gradient of the 3D field in the point, which meant evaluating the field function another four times. The amount of computation was staggering. I clearly needed to optimize the inputs of my OpenCL kernels, so they would look only to the elements that were contributing to the surface.

This realization took me to the second iteration using OpenCL, which will be the subject of my next post.





No comments:

Post a Comment