Speeding up calculations in .NET with CUDA

For those who have been living under a rock the last 10 years or so, utilizing the phenomenal parallelism of the graphics processor for some serious number crunching is a big thing these days. For traditional CPU bound developers the gap between the two worlds is not insignificant however. I look forward to the day when standardization of these features have reached a point where it is included in the standard libraries, a tool that we simply take out of the toolbox when the job requires it.

In the meantime there has been excellent work done by various groups to attempt to bridge that divide. As a .NET developer wanting to get my hands dirty with this exciting technology (albeit somewhat belatedly) the natural choice was to use the excellent CUDAfy.NET framework licensed under LGPL. An alternative framework that I have yet to test is Alea GPU. Surprisingly CUDAfy.NET supports both CUDA and OpenCL, the two major standards today (CUDA being proprietary to NVIDIA).

For my test application I went back to an old favorite of mine, rendering the Mandelbrot set fractal. Rendering this fractal is a thankful task to parallelize since the calculations for every pixel are entirely independent and can be split up as many ways as there are pixels to render. I also created a plugin architecture for the rendering application that would dynamically load different visualizations, be they Mandelbrot, Julia or something else. Two sets of visualization plugin implementations were implemented, one for the CPU and one for the GPU.

Another requirement was that when viewing the Mandelbrot set the Julia set should update instantly in a small PiP window. This makes sense since the Mandelbrot set is essentially a map of Julia sets in that each point on the complex plane corresponds to a unique set of input parameters for the Julia set. To this end I implemented a simple UI in WPF, a framework library as well as implementations of the Mandelbrot and Julia fractal plugins both for the CPU and for the GPU. The UI or the framework knows nothing about CUDAfy.NET.

To test whether a point belongs to the Mandelbrot set you have to iterate over a complex number variable a number of times until it either escapes past a certain limit or reaches a maximum iteration count (initially 256 in my implementation). The actual number is dynamically determined in the application by the zoom depth you find yourself at. A full screen rendering might thus require 1920 x 1152 x 256 = 566 million iterations in the worst case (if no points escape early). When zoomed in deeply the number of iterations necessary would grow with the increased maximum iteration count. Its a testament to the speed of our current generation of computers that this takes at most a few seconds running on one CPU core. Performing the same number of calculations on the GPU took only 1/20 of that time (1/200 if the calculations were done in floating point instead of double precision).

There are a number of known algorithmic optimizations you can employ when rendering the Mandelbrot fractal such as recursively testing the edges of rectangular regions. If the entire border is of a uniform color then you fill that region with that color. This is possible because the Mandelbrot set is a connected set (there are no islands) and allows us to skip a great number of calculations. I will leave that for as a possible future improvement.

Since initializing CUDAfy.NET uses the same code for both the Julia and Mandelbrot sets (and any future fractal plugins that will be implemented for the GPU) I created an extension method that would “cudafy” the instance on which its used.

public static class CudaExtensions
{
  
  private static void LoadTypeModule(GPGPU gpu, Type typeToCudafy)
  {
    var appFolder = AppDomain.CurrentDomain.BaseDirectory;
    var typeModulePath = Path.Combine(appFolder, typeToCudafy.Name + ".cdfy");
    var cudaModule = CudafyModule.TryDeserialize(typeModulePath);
    if (cudaModule == null || !cudaModule.TryVerifyChecksums())
    {
      cudaModule = CudafyTranslator.Cudafy(new[] { typeToCudafy });
      cudaModule.Serialize();
    }
    gpu.LoadModule(cudaModule, false);
  }

  public static void Execute<T>(this T instance, string kernel, int[] levels, byte[] colors, byte[] palette, RegionDefinition definition) where T: IFractal
  {
    CudafyModes.Target = eGPUType.Cuda;
    CudafyModes.DeviceId = 0;
    CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
    var deviceCount = CudafyHost.GetDeviceCount(CudafyModes.Target);
    if (deviceCount == 0) return;
    var gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
    LoadTypeModule(gpu, instance.GetType());
    double[] parameters = null;
    parameters = (instance.Parameters as object).ToDoubleArray();
    var devLevels = gpu.Allocate<int>(levels.Length);
    var devColors = gpu.Allocate<byte>(colors.Length);
    var devPalette = gpu.Allocate<byte>(palette.Length);
    var devParameters = gpu.Allocate<double>(parameters.Length);
    gpu.CopyToDevice(palette, devPalette);
    gpu.CopyToDevice(parameters, devParameters);
    const int gridSide = 128;

    if (definition.Width % gridSide != 0 || definition.Height % gridSide != 0)
    {
      throw new ArgumentException(string.Format("Width and height must be a multiple of {0}", gridSide));
    }

    var blockWidth = definition.Width / gridSide;
    var blockHeight = definition.Height / gridSide;

    gpu.Launch(new dim3(gridSide, gridSide), new dim3(blockWidth, blockHeight), kernel, devLevels, devColors, devPalette, definition.Width, definition.Height, definition.SetLeft,
      definition.SetTop, definition.SetWidth, definition.SetHeight,
      definition.MaxLevels, devParameters);

    gpu.CopyFromDevice(devLevels, levels);
    gpu.CopyFromDevice(devColors, colors);

    gpu.FreeAll();
    gpu.UnloadModules();
  }
}

There are a number of improvements that could be made to the initialization code above, especially dynamic determination of whether to use OpenCL or CUDA as well as the best grid and block parameters given the capabilities of the graphics system. Right now it assumes you have a fairly high end NVIDIA graphics card no older than a couple of years at most.

The trickiest part of coding for the GPU is that you have to form an inherently parallel mental model. This model is also typically spatial (2 or 3 dimensional grids and blocks) which makes perfect sense for graphics but which may require a change of perspective to adapt it to a more abstract problem. The focus on number crunching in arrays also means that you may as well throw whatever object model you had in mind to represent the data out the window; although given how bloated exceedingly object oriented architectures tend to get, this is more refreshing than anything else.

At the core are the concepts of a grid and thread blocks within each grid cell. Grids and blocks are usually 2 dimensional but can also be 3 or even 1 dimensional. In my limited testing however using one dimensional grids and blocks resulted in horrible performance and 3 dimensional ones simply didn’t make sense given nature of the problem (2D image rendering). How you apportion the problem into the blocks and over the grid determines the performance you can expect. In my implementation above I have chosen a two dimensional grid and two dimensional thread blocks. The overall grid is 128×128 and each grid cell is divided up into thread blocks whose size depend on the size of the image we want to generate. Each kernel instance will then have to calculate from its block and thread indices what part of the problem it should work on i.e. which point and pixel. The fixed grid dimensions simplify things for me but introduce some limitations on the size of images that can be generated (the sides must be a multiple of 128) but its nothing that can’t be solved with some gentle resizing and cropping in the UI. Optimizing around grids, blocks and threads would be a blog post of its own and besides others have done it better already.

You may wonder why above I allocate memory on the GPU for both the levels and the bitmap data these are mapped to. The reason is that the levels contain information that would otherwise be lost in converting them to colors. Suppose we want to create a 3D scene out of the fractal landscape. Then we would need both the levels and the colors.

For the actual kernel implementation CUDAfy.NET uses a [CUDAFY] attribute that you apply to the methods you wish it to translate to CUDA or OpenCL compatible code (see below). Care must be taken not to do anything in those methods that cannot be compiled by either compiler. Creating a new array (new int[256]) is for example a big no no in the kernel unless you use a special method to allocate shared memory within each block (I have yet to test this). In my example I pass references to global memory arrays into the kernel. Another issue is that .NET developers can sometimes become sloppy with data types. An errant numeric literal can cause problems varying in level of severity depending on the capabilities of the graphics card and SDK you are using.

public class MandelbrotGpu: Mandelbrot
{

  public override string Name => "Mandelbrot (GPU)";

  public override Guid Id => new Guid("ed87ad6e2c984ef0aba5cf00f63b85a2");

  public override Guid? LinkedId => new Guid("bba39b3f89e542cfb13139319c46f10b");

  public override RegionData Generate(RegionDefinition definition, byte[] palette)
  {
    var data = new RegionData(definition);

    this.Execute("MandelbrotKernel", data.Levels, data.Colors, palette, definition);

    return data;
  }

  [Cudafy]
  public static void MandelbrotKernel(GThread thread, int[] levels, byte[] colors, byte[] palette, int w, int h, double sx, double sy, double sw, double sh, int maxLevels, double[] parameters)
  {
    var x = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
    var y = thread.blockDim.y * thread.blockIdx.y + thread.threadIdx.y;
    var offset = x + y * w;
    var xstep = sw/w;
    var ystep = sh/h;
    var cx = sx + x*xstep;
    var cy = sy + y*ystep;
    var colorOffset = offset * 4;
    var level = MSetLevel(cx, cy, maxLevels);
    levels[offset] = level;

    if (level < maxLevels)
    {
      var paletteOffset = level * 3 % palette.Length;
      colors[colorOffset] = palette[paletteOffset + 2];
      colors[colorOffset + 1] = palette[paletteOffset + 1];
      colors[colorOffset + 2] = palette[paletteOffset];
      colors[colorOffset + 3] = 255;
    }
    else
    {
      colors[colorOffset] = 0;
      colors[colorOffset + 1] = 0;
      colors[colorOffset + 2] = 0;
      colors[colorOffset + 3] = 255;
    }

  }

  [Cudafy]
  public static int MSetLevel(double cr, double ci, int max)
  {
    const double bailout = 4.0;
    var zr = 0.0;
    var zi = 0.0;
    var zrs = 0.0;
    var zis = 0.0;
    var i = 0;
    while (i < max && (zis + zrs) < bailout)
    {
      zi = 2.0 * (zr * zi) + ci;
      zr = (zrs - zis) + cr;
      zis = zi * zi;
      zrs = zr * zr;
      i++;
    }
    return i;
  }
}

Contrast the above GPU implementation to the one for the CPU (they render identical images).

Edit: I decided to give the CPU a fair shot and parallelized the rendering for it too. This resulted in a roughly four fold performance increase (still at least 5 times slower than the GPU). This is on a single Intel i7 CPU with four cores and hyperthreading. Notice the calculation of the MaxDegreeOfParallelism. It’s currently set to the number of logical processors times two (8 x 2 = 16) which seems like a reasonable trade-off.

public class MandelbrotCpu: Mandelbrot
{

  public override string Name => "Mandelbrot (CPU)";

  public override Guid Id => new Guid("9fe96fcd474649c6a6be3472ec794336");

  public override Guid? LinkedId => new Guid("c6d79046fed34ac9a800908b193218ad");

  public override RegionData Generate(RegionDefinition definition, byte[] palette)
  {
    var data = new RegionData(definition);
    var w = definition.Width;
    var h = definition.Height;
    var pixels = w * h;

    var sx = definition.SetWidth / w;
    var sy = definition.SetHeight / h;

    var degree = Environment.ProcessorCount * 2;

    Parallel.For(0, pixels, new ParallelOptions { MaxDegreeOfParallelism = degree }, index =>
    {
      var i = index % w;
      var j = (index - i) / w;
      var x = definition.SetLeft + i * sx;
      var y = definition.SetTop + j * sy;

      var level = MSetLevel(x, y, definition.MaxLevels, data.Colors, index * 4, palette);
      data.SetLevel(index, level);
      var colors = data.Colors;
      var colorOffset = index * 4;
      if (level < definition.MaxLevels)
      {
        var paletteOffset = level * 3 % palette.Length;
        colors[colorOffset] = palette[paletteOffset + 2];
        colors[colorOffset + 1] = palette[paletteOffset + 1];
        colors[colorOffset + 2] = palette[paletteOffset];
        colors[colorOffset + 3] = 255;
      }
      else
      {
        colors[colorOffset] = 0;
        colors[colorOffset + 1] = 0;
        colors[colorOffset + 2] = 0;
        colors[colorOffset + 3] = 255;
      }
    });

    return data;
  }

  private static int MSetLevel(double cr, double ci, int max, byte[] colors, int colorIndex, byte[] palette)
  {
    const double bailout = 4.0;
    var zr = 0.0;
    var zi = 0.0;
    var zrs = 0.0;
    var zis = 0.0;
    var i = 0;
     
    while(i < max && (zis + zrs) < bailout)
    {
      zi = 2.0*(zr*zi) + ci;
      zr = (zrs - zis) + cr;
      zis = zi * zi;
      zrs = zr * zr;
      i++;
    }

    if (i < max)
    { 
      var paletteIndex = i*3%palette.Length;
      colors[colorIndex] = palette[paletteIndex+2];
      colors[colorIndex+1] = palette[paletteIndex+1];
      colors[colorIndex+2] = palette[paletteIndex];
      colors[colorIndex + 3] = 255;
    }
    else
    {
      colors[colorIndex] = 0;
      colors[colorIndex + 1] = 0;
      colors[colorIndex + 2] = 0;
      colors[colorIndex + 3] = 255;
    }

    return i;
  }
}

There is of course more to the implementation than this – for one I might put the source up on GitHub if there is enough interest – but it should hopefully be enough to wet you appetite and get you started.

Edit: The source code for this article is now up on GitHub.

The CUDAfy.NET distribution comes with some excellent example applications that will help you on the way as well.

Going forward I’m planning to look into optimizing the grid, block, thread allocations further, look into using shared memory to minimize access to global memory and experimenting further with CUDA for simulation and machine learning.