Speeding up calculations in .NET with CUDA

For those who have been living under a rock the last 10 years or so, utilizing the phenomenal parallelism of the graphics processor for some serious number crunching is a big thing these days. For traditional CPU bound developers the gap between the two worlds is not insignificant however. I look forward to the day when standardization of these features have reached a point where it is included in the standard libraries, a tool that we simply take out of the toolbox when the job requires it.

In the meantime there has been excellent work done by various groups to attempt to bridge that divide. As a .NET developer wanting to get my hands dirty with this exciting technology (albeit somewhat belatedly) the natural choice was to use the excellent CUDAfy.NET framework licensed under LGPL. An alternative framework that I have yet to test is Alea GPU. Surprisingly CUDAfy.NET supports both CUDA and OpenCL, the two major standards today (CUDA being proprietary to NVIDIA).

For my test application I went back to an old favorite of mine, rendering the Mandelbrot set fractal. Rendering this fractal is a thankful task to parallelize since the calculations for every pixel are entirely independent and can be split up as many ways as there are pixels to render. I also created a plugin architecture for the rendering application that would dynamically load different visualizations, be they Mandelbrot, Julia or something else. Two sets of visualization plugin implementations were implemented, one for the CPU and one for the GPU.

Another requirement was that when viewing the Mandelbrot set the Julia set should update instantly in a small PiP window. This makes sense since the Mandelbrot set is essentially a map of Julia sets in that each point on the complex plane corresponds to a unique set of input parameters for the Julia set. To this end I implemented a simple UI in WPF, a framework library as well as implementations of the Mandelbrot and Julia fractal plugins both for the CPU and for the GPU. The UI or the framework knows nothing about CUDAfy.NET.

To test whether a point belongs to the Mandelbrot set you have to iterate over a complex number variable a number of times until it either escapes past a certain limit or reaches a maximum iteration count (initially 256 in my implementation). The actual number is dynamically determined in the application by the zoom depth you find yourself at. A full screen rendering might thus require 1920 x 1152 x 256 = 566 million iterations in the worst case (if no points escape early). When zoomed in deeply the number of iterations necessary would grow with the increased maximum iteration count. Its a testament to the speed of our current generation of computers that this takes at most a few seconds running on one CPU core. Performing the same number of calculations on the GPU took only 1/20 of that time (1/200 if the calculations were done in floating point instead of double precision).

There are a number of known algorithmic optimizations you can employ when rendering the Mandelbrot fractal such as recursively testing the edges of rectangular regions. If the entire border is of a uniform color then you fill that region with that color. This is possible because the Mandelbrot set is a connected set (there are no islands) and allows us to skip a great number of calculations. I will leave that for as a possible future improvement.

Since initializing CUDAfy.NET uses the same code for both the Julia and Mandelbrot sets (and any future fractal plugins that will be implemented for the GPU) I created an extension method that would “cudafy” the instance on which its used.

public static class CudaExtensions
{
  
  private static void LoadTypeModule(GPGPU gpu, Type typeToCudafy)
  {
    var appFolder = AppDomain.CurrentDomain.BaseDirectory;
    var typeModulePath = Path.Combine(appFolder, typeToCudafy.Name + ".cdfy");
    var cudaModule = CudafyModule.TryDeserialize(typeModulePath);
    if (cudaModule == null || !cudaModule.TryVerifyChecksums())
    {
      cudaModule = CudafyTranslator.Cudafy(new[] { typeToCudafy });
      cudaModule.Serialize();
    }
    gpu.LoadModule(cudaModule, false);
  }

  public static void Execute<T>(this T instance, string kernel, int[] levels, byte[] colors, byte[] palette, RegionDefinition definition) where T: IFractal
  {
    CudafyModes.Target = eGPUType.Cuda;
    CudafyModes.DeviceId = 0;
    CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
    var deviceCount = CudafyHost.GetDeviceCount(CudafyModes.Target);
    if (deviceCount == 0) return;
    var gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
    LoadTypeModule(gpu, instance.GetType());
    double[] parameters = null;
    parameters = (instance.Parameters as object).ToDoubleArray();
    var devLevels = gpu.Allocate<int>(levels.Length);
    var devColors = gpu.Allocate<byte>(colors.Length);
    var devPalette = gpu.Allocate<byte>(palette.Length);
    var devParameters = gpu.Allocate<double>(parameters.Length);
    gpu.CopyToDevice(palette, devPalette);
    gpu.CopyToDevice(parameters, devParameters);
    const int gridSide = 128;

    if (definition.Width % gridSide != 0 || definition.Height % gridSide != 0)
    {
      throw new ArgumentException(string.Format("Width and height must be a multiple of {0}", gridSide));
    }

    var blockWidth = definition.Width / gridSide;
    var blockHeight = definition.Height / gridSide;

    gpu.Launch(new dim3(gridSide, gridSide), new dim3(blockWidth, blockHeight), kernel, devLevels, devColors, devPalette, definition.Width, definition.Height, definition.SetLeft,
      definition.SetTop, definition.SetWidth, definition.SetHeight,
      definition.MaxLevels, devParameters);

    gpu.CopyFromDevice(devLevels, levels);
    gpu.CopyFromDevice(devColors, colors);

    gpu.FreeAll();
    gpu.UnloadModules();
  }
}

There are a number of improvements that could be made to the initialization code above, especially dynamic determination of whether to use OpenCL or CUDA as well as the best grid and block parameters given the capabilities of the graphics system. Right now it assumes you have a fairly high end NVIDIA graphics card no older than a couple of years at most.

The trickiest part of coding for the GPU is that you have to form an inherently parallel mental model. This model is also typically spatial (2 or 3 dimensional grids and blocks) which makes perfect sense for graphics but which may require a change of perspective to adapt it to a more abstract problem. The focus on number crunching in arrays also means that you may as well throw whatever object model you had in mind to represent the data out the window; although given how bloated exceedingly object oriented architectures tend to get, this is more refreshing than anything else.

At the core are the concepts of a grid and thread blocks within each grid cell. Grids and blocks are usually 2 dimensional but can also be 3 or even 1 dimensional. In my limited testing however using one dimensional grids and blocks resulted in horrible performance and 3 dimensional ones simply didn’t make sense given nature of the problem (2D image rendering). How you apportion the problem into the blocks and over the grid determines the performance you can expect. In my implementation above I have chosen a two dimensional grid and two dimensional thread blocks. The overall grid is 128×128 and each grid cell is divided up into thread blocks whose size depend on the size of the image we want to generate. Each kernel instance will then have to calculate from its block and thread indices what part of the problem it should work on i.e. which point and pixel. The fixed grid dimensions simplify things for me but introduce some limitations on the size of images that can be generated (the sides must be a multiple of 128) but its nothing that can’t be solved with some gentle resizing and cropping in the UI. Optimizing around grids, blocks and threads would be a blog post of its own and besides others have done it better already.

You may wonder why above I allocate memory on the GPU for both the levels and the bitmap data these are mapped to. The reason is that the levels contain information that would otherwise be lost in converting them to colors. Suppose we want to create a 3D scene out of the fractal landscape. Then we would need both the levels and the colors.

For the actual kernel implementation CUDAfy.NET uses a [CUDAFY] attribute that you apply to the methods you wish it to translate to CUDA or OpenCL compatible code (see below). Care must be taken not to do anything in those methods that cannot be compiled by either compiler. Creating a new array (new int[256]) is for example a big no no in the kernel unless you use a special method to allocate shared memory within each block (I have yet to test this). In my example I pass references to global memory arrays into the kernel. Another issue is that .NET developers can sometimes become sloppy with data types. An errant numeric literal can cause problems varying in level of severity depending on the capabilities of the graphics card and SDK you are using.

public class MandelbrotGpu: Mandelbrot
{

  public override string Name => "Mandelbrot (GPU)";

  public override Guid Id => new Guid("ed87ad6e2c984ef0aba5cf00f63b85a2");

  public override Guid? LinkedId => new Guid("bba39b3f89e542cfb13139319c46f10b");

  public override RegionData Generate(RegionDefinition definition, byte[] palette)
  {
    var data = new RegionData(definition);

    this.Execute("MandelbrotKernel", data.Levels, data.Colors, palette, definition);

    return data;
  }

  [Cudafy]
  public static void MandelbrotKernel(GThread thread, int[] levels, byte[] colors, byte[] palette, int w, int h, double sx, double sy, double sw, double sh, int maxLevels, double[] parameters)
  {
    var x = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
    var y = thread.blockDim.y * thread.blockIdx.y + thread.threadIdx.y;
    var offset = x + y * w;
    var xstep = sw/w;
    var ystep = sh/h;
    var cx = sx + x*xstep;
    var cy = sy + y*ystep;
    var colorOffset = offset * 4;
    var level = MSetLevel(cx, cy, maxLevels);
    levels[offset] = level;

    if (level < maxLevels)
    {
      var paletteOffset = level * 3 % palette.Length;
      colors[colorOffset] = palette[paletteOffset + 2];
      colors[colorOffset + 1] = palette[paletteOffset + 1];
      colors[colorOffset + 2] = palette[paletteOffset];
      colors[colorOffset + 3] = 255;
    }
    else
    {
      colors[colorOffset] = 0;
      colors[colorOffset + 1] = 0;
      colors[colorOffset + 2] = 0;
      colors[colorOffset + 3] = 255;
    }

  }

  [Cudafy]
  public static int MSetLevel(double cr, double ci, int max)
  {
    const double bailout = 4.0;
    var zr = 0.0;
    var zi = 0.0;
    var zrs = 0.0;
    var zis = 0.0;
    var i = 0;
    while (i < max && (zis + zrs) < bailout)
    {
      zi = 2.0 * (zr * zi) + ci;
      zr = (zrs - zis) + cr;
      zis = zi * zi;
      zrs = zr * zr;
      i++;
    }
    return i;
  }
}

Contrast the above GPU implementation to the one for the CPU (they render identical images).

Edit: I decided to give the CPU a fair shot and parallelized the rendering for it too. This resulted in a roughly four fold performance increase (still at least 5 times slower than the GPU). This is on a single Intel i7 CPU with four cores and hyperthreading. Notice the calculation of the MaxDegreeOfParallelism. It’s currently set to the number of logical processors times two (8 x 2 = 16) which seems like a reasonable trade-off.

public class MandelbrotCpu: Mandelbrot
{

  public override string Name => "Mandelbrot (CPU)";

  public override Guid Id => new Guid("9fe96fcd474649c6a6be3472ec794336");

  public override Guid? LinkedId => new Guid("c6d79046fed34ac9a800908b193218ad");

  public override RegionData Generate(RegionDefinition definition, byte[] palette)
  {
    var data = new RegionData(definition);
    var w = definition.Width;
    var h = definition.Height;
    var pixels = w * h;

    var sx = definition.SetWidth / w;
    var sy = definition.SetHeight / h;

    var degree = Environment.ProcessorCount * 2;

    Parallel.For(0, pixels, new ParallelOptions { MaxDegreeOfParallelism = degree }, index =>
    {
      var i = index % w;
      var j = (index - i) / w;
      var x = definition.SetLeft + i * sx;
      var y = definition.SetTop + j * sy;

      var level = MSetLevel(x, y, definition.MaxLevels, data.Colors, index * 4, palette);
      data.SetLevel(index, level);
      var colors = data.Colors;
      var colorOffset = index * 4;
      if (level < definition.MaxLevels)
      {
        var paletteOffset = level * 3 % palette.Length;
        colors[colorOffset] = palette[paletteOffset + 2];
        colors[colorOffset + 1] = palette[paletteOffset + 1];
        colors[colorOffset + 2] = palette[paletteOffset];
        colors[colorOffset + 3] = 255;
      }
      else
      {
        colors[colorOffset] = 0;
        colors[colorOffset + 1] = 0;
        colors[colorOffset + 2] = 0;
        colors[colorOffset + 3] = 255;
      }
    });

    return data;
  }

  private static int MSetLevel(double cr, double ci, int max, byte[] colors, int colorIndex, byte[] palette)
  {
    const double bailout = 4.0;
    var zr = 0.0;
    var zi = 0.0;
    var zrs = 0.0;
    var zis = 0.0;
    var i = 0;
     
    while(i < max && (zis + zrs) < bailout)
    {
      zi = 2.0*(zr*zi) + ci;
      zr = (zrs - zis) + cr;
      zis = zi * zi;
      zrs = zr * zr;
      i++;
    }

    if (i < max)
    { 
      var paletteIndex = i*3%palette.Length;
      colors[colorIndex] = palette[paletteIndex+2];
      colors[colorIndex+1] = palette[paletteIndex+1];
      colors[colorIndex+2] = palette[paletteIndex];
      colors[colorIndex + 3] = 255;
    }
    else
    {
      colors[colorIndex] = 0;
      colors[colorIndex + 1] = 0;
      colors[colorIndex + 2] = 0;
      colors[colorIndex + 3] = 255;
    }

    return i;
  }
}

There is of course more to the implementation than this – for one I might put the source up on GitHub if there is enough interest – but it should hopefully be enough to wet you appetite and get you started. The CUDAfy.NET distribution comes with some excellent example applications that will help you on the way as well.

Going forward I’m planning to look into optimizing the grid, block, thread allocations further, look into using shared memory to minimize access to global memory and experimenting further with CUDA for simulation and machine learning.

Calcium: a simple calculator for infix and rpn expressions.

Just a quick update on a small side project that I recently completed.

It started out as a test driven development code kata which involved implementing the classic shunting-yard algorithm by Edsger Dijkstra as well as a basic RPN evaluator. Additions to the basic algorithms includes support for named variables, a constants, and conversion of infix expressions to rpn.

The end result is a simple calculator for the windows desktop, written in WPF/C# for the .NET 4 platform, which allows you to input complete mathematical expressions in either infix or postfix (rpn) notation. It is designed to be easy to use by minimizing the use of settings and mode toggles. It will for example automatically detect if what you have entered is in infix or postfix notation.

Calcium

Calcium, a simple infix and rpn calculator.

As with all my utility side project the main goal was to create something that I would like to use myself. I have annoyed myself at the built in desktop calculator for some time while other more able calculators I have tried have seemed like overkill for my needs.

The unit tests written during development have been a real boon in that every time an error was discovered I would just add the offending expression to the tests and work through the code until the problem was resolved. Re-running all the tests then verifies that nothing else was broken in the process. This would have been a much more laborious and error prone process had the test scaffolding not been in place. Examples such as this just reaffirms my
conviction that unit tests are indispensible for maintaining code quality and correctness.

Note: As is always the case with experimental software such as this I give no guarantee that it will always perform correctly (so please don’t use it for anything critical).

If you discover that the calculator returns something unexpected or plain wrong then don’t hesitate to leave me a comment to tell me about it.

If you are interested in trying it out you can download it here: Calcium 1.0.1.0

ChainPop: Fun with games and Windows Azure

As is the tradition for me I got some for-fun coding done during the Christmas holiday. While playing Minecraft with my 7 year old son we discussed how it is actually possible for a single person to make a game (which got him excited). I therefore decided to follow this up with some game making of my own (the offspring helped out with play testing when I could drag them away from Minecraft). I took some old collision detection code I had experimented with before and proceeded to turn it into a game. Getting into playable condition was quite fast. Polishing it up and play testing it took a while longer which is why I’m presenting it here in February.

ChainPop

ChainPop

As I currently work on projects involving Windows Azure I decided to add an online scoreboard which you will find here: http://chainpop.azurewebsites.net. You can also download the game directly here. The scoreboard is an extremely simple Web Application / REST(ish) api built with NancyFx. Scores are stored in and retrieved from an Azure Table Storage table and the nancy module serving as an entry point is all of two dozen lines long (see below).

public class IndexModule : NancyModule
{
  public IndexModule()
  {
    var configuration = ServiceConfiguration.FromConfigurationManager();
    var scoreRepository = new ScoreRepository(configuration);
 
    Get["/"] = pm =>
    {
      var application = configuration.Application;
      var version = configuration.Version;
      var mode = configuration.Mode;
      var topList = scoreRepository.GetTopList(
        application, version, mode).Take(100).ToList();
      return View["index", topList];
    };

    Post["/api/score/"] = _ =>
    {
      var score = this.Bind<GameScore>();
      var gameScoreValidator = new GameScoreValidator();
      var validationResult = gameScoreValidator.Validate(score);
      if (!validationResult.IsValid) 
        return Response.AsJson(
          validationResult, HttpStatusCode.BadRequest);
      var result = scoreRepository.Add(score);
      return result;
    };
  }
}

Note that ChainPop is a WPF application and requires .NET Framework 4 to run. Try it out and tell me what you think. More is sure to follow.

Babysteps with Node.js and Mocha

Not being a JavaScript expert I still took a special interest in the Node.js sessions during my time at Visual Studio Live Chicago 2013. Running JavaScript server side seemed very intriguing and the examples given made it clear that that setting up a simple web server was a piece of cake. Node.js utilizes the V8 JavaScript Engine to run the scripts you create. Having previously experimented with the HTML5 canvas controll & JavaScript it seems obvious that this engine is far superior to those of the competition.

The best session on this topic at VSLive Chicago was by far the one titled JavaScript, Meet Cloud: Node.js on Windows Azure by Sasha Goldshtein. An excellent speaker covering such topics as deploying Node.js applications to Windows Azure, writing Node.js MVC(ish) applications with Express as well as using the Jade template engine. Sasha seemed to be prepared for any eventuality and seamlessly switched to a pre-recorded video when the Windows Azure management interface once acted up while he was demonstrating a deployment.

My own experimentation resulted in a trivial server application to compute prime numbers and return them as json data. Unit tests were created with the help of the Mocha unit test framework. The server code itself:

var http = require('http');
var url = require('url');
var primecalc = require('primecalc');

var port = 8000;

var server = http.createServer(function (request, response) {
  var number = url.parse(request.url).path.substring(1);
  response.writeHead(200, { "Content-Type": "application/json" });
  var primes = primecalc.getPrimes(number);
  response.end(JSON.stringify(primes));
}); 

server.listen(port);

module.exports = server;

console.log("Server running at http://localhost:" + port + "/");

Integration tests for the server:

var app = require('../primes');
var http = require('http'); 
var assert = require('assert');
describe("Tests for the primes server application\n", function () {
  describe('/', function () {
    it('should return 200', function (done) {
      http.get('http://localhost:8000', function (res) {
        assert.equal(200, res.statusCode);
        done();
      });
    });
  });
  describe('/31', function () {
    it('should return primes between 0 and 31', function (done) {
      http.get('http://localhost:8000/31', function (res) {
        var data = '';
        res.on('data', function (chunk) {
          data += chunk;
        });
        res.on('end', function () {
          assert.equal('[3,5,7,11,13,17,19,23,29,31]', data);
          done();
        });
      });
    });
  });
});

The prime calculating code itself looks like so (and leaves much to be desired):

function primecalc()
{
	this.isPrime = function(number)
	{
		if (number < 2) return false;
		if (number == 2) return true;
		if (number % 2 == 0) return false;
		for (var i = 2; i < number; i++)
		{
			if (number % i == 0) return false; 
		}
		return true;
	}
	this.getPrimes = function(upto)
	{
		var primes = new Array();
		for (var i = 3; i <= upto; i++)
		{
			if (this.isPrime(i)) primes.push(i);
		}
		return primes;
	}
}
module.exports = new primecalc();

Unit tests for the prime number generator:

var app = require('../lib/primecalc'); 
var assert = require('assert');
describe("Tests for the primecalc class\n", function () {
	describe("\tTesting the getPrimes function\n", function () {
		describe("\t\tWhen it is called with upto = 31\n", function () {
			it("\t\t\tshould return all primes between 0 and 31\n", function () {
				var primes = app.getPrimes(31);
				assert.deepEqual(primes, [3, 5, 7, 11, 13, 17, 19, 23, 29, 31]);
			});
		});
	});
	describe("\tTesting the getPrimes function\n", function () {
		describe("\t\tWhen it is called with upto = 0\n", function () {
			it("\t\t\tshould return an empty array\n", function () {
				var primes = app.getPrimes(0);
				assert.deepEqual(primes, []);
			});
		});
	});
	describe("\tTesting the isPrime function\n", function () {
		describe("\t\tWhen it is called with prime value\n", function () {
			it("\t\t\tshould return true\n", function () {
				var primes = app.isPrime(13);
				assert.equal(primes, true);
			});
		});
	});
	describe("\tTesting the isPrime function\n", function () {
		describe("\t\tWhen it is called with a non prime value\n", function () {
			it("\t\t\tshould return false\n", function () {
				var primes = app.isPrime(64);
				assert.equal(primes, false);
			});
		});
	});
});

The main challenges putting this together had to do with testing the json data returned by the server itself (had to do a fair bit of Googling for that) as well as how to reference external code (primecalc) in a Node.js application. The only way I got Node to resolve it was by creating a minimal package (with an index.js file) under the node_modules directory (under the project directory). Trying to simply link in an external js file was a complete no-go. To run the unit tests I resorted to simple command files but realise that setting up a proper Make file would make it much more convenient (next post perhaps). I guess I have become a bit spoiled by the way the .NET framework and Visual Studio serves everything up for you. Every little snag with Node.js and Mocha was maddening but getting the tests to finally run was very satisfying.

Pushing Lightswitch at VSLive Chicago 2013

VSLive Chicago 2013 Keynote

The day 1 keynote at Visual Studio Live in Chicago is titled “Visual Studio, .NET and the Cloud” and was delivered by Jay Schmelzer (Director of Program Management for the Visual Studio Team at Microsoft). Heavy emphasis was put on development with Visual Studio Lightswitch, Sharepoint and Office 365.

Visual Studio Lightswitch is an interesting solution for those plain vanilla business applications that really don’t benefit from a fully custom approach. Why reinvent the wheel by constantly writing the same old validation code for things like phone and social security numbers?

At the same time the developer community seems to be almost universally skeptic as to whether it is actually a good idea. There are a number of reasons for this. Some developers are dismissive of the notion that we will ever be able to create complex business applications without writing code. Yet others object to the notion that non-technical people will be able to use the tool. It may not be development but is still quite complex.

Regardless of the possible objections I think this is an interesting option which might become popular for those business applications that don’t really need to follow a custom design or have much in the way of really complex business logic.

In short, even though the idea of developing in Visual Studio Lightswitch doesn’t exactly fill me with warm and fuzzy feelings, I do see a business case here. If you have a different opinion please don’t hesitate to comment.

Visual Studio Live 2013 in Chicago

Chicago

I am currently at the Visual Studio Live conference in Chicago attempting to absorb all the information I can on subjects as diverse as Windows 8 application development, Windows Azure, and Node.js.

The first day is all workshops and I’m attending the “Build a Windows 8 Application in a Day” workshop held by Rockford Lhotka. The term “workshop” is a bit misleading as it is more of an in-depth overview of developing for the Windows App Store whether through C# or JavaScript.

Given that this conference is not arranged by Microsoft, the discussion turns out to be quite candid and open. Windows 8 represents a significant change and is in many areas (such as touch and tablet computing support) clearly an improvement. It is obvious however that the right-brain, left-brain split between the “Metro” and the Desktop sides has caused significant confusion in the developer community.

Once I have collected my thoughts and finalized some snippets of code I have been working on I will post it here. Interestingly enough I found myself, during the day, drifting over to the JavaScript side of Windows 8 application development rather than the more natural (for me) C# route.

Pushing pixels with the HTML5 canvas.

Having seen a HTML5 and javascript canvas demo that my colleague Emil Åström had put together I made a mental note to try my hand at it myself as soon as time would allow. What intrigued me the most was the frame rates he was able to achieve when performing quite a lot of calculations for every pixel. This is a truly impressive leap for JavaScript even though the performance you could expect when writing this sort of code closer to the hardware (e.g. in Assembler) would of course completely wipe the floor with it.

My experimental application is a cosine interpolation between points on a grid “overlaying” pixels of the canvas. The interpolated elevation is translated into a color by interpolating between three colors. It is worth noting that this code was written for readability and understanding rather than with performance in mind.

I have posted the source to GitHub and you may directly preview the page here.

Canvas Plasma

The goal I set myself was to get the demo up and running even in full screen with reasonable performance. To this end I decided to render the plasma to a smaller canvas off screen and then paste it to the larger canvas for each frame.

To do so requires a rather elaborate setup.

// Set up our rendering canvas, context and image data.
var sw = 240;
var sh = 240;
var renderCanvas = createCanvas(sw, sh);
var renderContext = renderCanvas.getContext("2d");
var renderImageData = renderContext.createImageData(sw, sh);

// Retrieving our target canvas, and context.
var targetCanvas = document.getElementById("plasmaCanvas");
resizeTargetCanvas(targetCanvas);
var targetContext = targetCanvas.getContext("2d");

// Scale between the render and target canvas sizes.
targetContext.scale(targetCanvas.width/renderCanvas.width, 
targetCanvas.height/renderCanvas.height);

Once this is done it is basically just a matter of pasting
each frame onto the target canvas in the render loop.

// Write image data to render canvas.
renderContext.putImageData(renderImageData, 0, 0);
// Draw render canvas onto the target canvas.
targetContext.drawImage(renderCanvas,0, 0);

To keep it all straight in my head I decided that a few helper “classes” were necessary. The most central of them is the one dealing with the grid data and its animation. The others are a simple color class for interpolation and a grid point holding a value and an animation direction (up or down).

function grid(w, h, from, to, step)
{
    this.w = w;
    this.h = h;
    this.from = from;
    this.to = to;
    this.step = step;
    this.data = new Array(w);
    for (var i = 0; i < this.data.length; i++) {
        this.data[i] = new Array(h);
        for (var j = 0; j < this.data[i].length; j++)
        {
            this.data[i][j] = 
                new gridPoint(Math.random(), 
                    Math.random() < 0.5);
        }
    }
    this.animate = function()
    {
        for (var i = 0; i < this.w; i++) {
            for (var j = 0; j < this.h; j++)
            {
                var v = this.data[i][j].v;
                var d = this.data[i][j].d;
                var rs = Math.random() * this.step;
                v = d == true ? v + rs : v - rs;
                if (v > this.to)
                {
                    v = this.to;
                    d = false;
                }
                if (v < this.from)
                {
                    v = this.from;
                    d = true;
                }
                this.data[i][j].v = v;
                this.data[i][j].d = d;
            }
        } 
    }
}

function color(r, g, b, a) {
    this.r = r;
    this.g = g;
    this.b = b;
    this.a = a;
}

function gridPoint(v, d) {
    this.v = v;
    this.d = d;
}

The interpolation over the grid onto the canvas pixel data is encapsulated in the following functions. A nice writeup on cosine interpolation may be found here.

function gridInterpolate(x, y, grid)
{
    // How much space between grid 
    // points in unit coordinates (0-1).
    var gsx =  1 / (grid.w - 1);            
    var gsy =  1 / (grid.h - 1); 
    // Top/left grid point for grid 
    // quadrant that this point belongs to.
    var ox = Math.floor(x / gsx);
    var oy = Math.floor(y / gsy);
    // Elevations for all grid 
    // points in quadrant.
    var ga = grid.data[ox][oy].v;
    var gb = grid.data[ox + 1][oy].v;
    var gc = grid.data[ox][oy + 1].v;
    var gd = grid.data[ox + 1][oy + 1].v;
    // Unit coordinate within grid.
    var gx = (x - (ox * gsx)) / gsx;
    var gy = (y - (oy * gsy)) / gsy;
    // Interpolate for elevations along 
    // top, left, bottom, and right.
    var zt = cosineInterpolate(ga, gb, gx);
    var zl = cosineInterpolate(ga, gc, gy);
    var zb = cosineInterpolate(gc, gd, gx);
    var zr = cosineInterpolate(gb, gd, gy);
    // Interpolate exact elevation for 
    // the point.
    return (cosineInterpolate(zl, zr, gx) + 
    cosineInterpolate(zt, zb, gy)) / 2;  
}

function cosineInterpolate(p1, p2, v)
{
    var ft = v * Math.PI;
    var f = (1 - Math.cos(ft))*0.5;
    return p1 * (1 - f) + p2 * f;
}

Notable observations which I have taken from this experience is that Google Chrome basically wipes the floor with IE9/10 in terms of framerate. Using Google Chrome on my Apple MacBook Pro (Late 2011) I get 90+ frames per second in full screen running this, while IE barely reaches 15 frames per second. The performance seems to vary considerably depending on hardware as running it on my work machine (Lenovo T410S) results in more flickering and lower frame rates.

In summary this was an extremely fun little side project and I am sure to be continuing to experiment with the HTML5 canvas in the future.

Autonomous Sphero with orbBasic

This is a quick update on my earlier post where I have included orbBasic code for the Orbotix Sphero which will give it some rudimentary autonomy. Observe that the code comments below should not be sent to the Sphero. I am currently working on creating a more user friendly tool for loading basic into the sphero. The undocumented command line tool included in the embryonic framework from my previous post will read an orbBasic file, strip the comments, and transfer it to the Sphero. However, since I only posted source code you will need Visual Studio to compile and run it.

' We set color to green and wait for  
' a small bump to start the program 
' proper.
10 basflg 1
20 RGB 0, 255, 0
30 if accelone < 4000 then goto 30 
' Initializing
40 H = rnd 359
50 D = 1
60 S = rnd 192 + 63
' Set random color
70 RGB rnd 255, rnd 255, rnd 255
' Roll in random direction until 15 
' seconds have passed (to avoid 
' getting stuck after soft collision) 
' or until we hit something.
80 goroll H, S, D
90 timerA = 15000
100 timerB = rnd 1500 + 1500
110 if accelone > 5000 or timerA = 0 then goto 150
120 if timerB > 0 then goto 110
' Every few seconds we randomly 
' adjust our heading somewhat 
' (+/- 15 degrees) and continue.
130 H = (H + rnd 30 - 15) % 360
140 goto 60
' We hit something and perform 
' a hard reverse.
150 H = ((H + 180) % 360)
160 goroll H, 255, 2
170 delay 1000
' Lets take it from the top.
180 goto 60

Balls out fun with the Sphero and .NET

Having received a Sphero robotic ball (made by Orbotix) as a christmas present from my wife (yes, she is truly amazing) I went into a frenzy of coding over the holidays. Robotics have always interested me but I never got around to experimenting with it before now.

For those of you looking to tinker a bit with robotics I cannot recommend this little gadget highly enough.  The Sphero is a hermetically sealed spherical shell of polycarbonate crammed full of robotic goodness. Built in sensors include an accelerometer, a compass, and a gyroscope. All communication with the Sphero is performed via Bluetooth through the sealed shell which makes sense as it is intended to be used outdoors and even in water (it floats). The device is thoroughly documented which allows for some very satisfying hacking activity, something which seems to be encouraged by Orbotix, the company behind this little marvel.

sphero

Driving the Sphero around the livingroom using a smartphone for a controller is fun and extremely entertaining for adults, offspring and felines alike, although, for a device promoted as a Robot, it started to feel more and more like a radio controlled ball. A true robot is supposed to scurry around the place and do stuff without me having to frantically wave my phone around. After searching the developer forums for a while I found that there is actually an implementation of BASIC (orbBasic) which will run on the Sphero and allow it to exhibit more autonomy. My first goal became to find a way to upload an orbBasic program to the Sphero from a .NET application and run it.

Sadly the developer SDK’s are only available for mobile platforms (namely Android and iOS) for now. Luckily the low-level documentation for the Sphero is excellent and allowed me to ping the device via Bluetooth from a .NET application and receive a reply within an hour or so of tinkering.

After some additional work and pooring over the documentation I now have a fairly decent (if hurried) experimental framework SpheroNET v0.1 up and running. Please note that it IS experimental and is likely to throw exceptions the moment you look at it funny. The solution includes a console application for testing but nothing else.

What was needed:

  1. A Sphero robotic ball.
  2. A way of sending and receiving data over bluetooth in .NET. For this I used the excellent (and free) 32feet.NET.
  3. The API documentation
  4. orbBasic documentation

First we need to find some devices to connect to.

BluetoothClient client = new BluetoothClient();
List<BluetoothDeviceInfo> devices = new List<BluetoothDeviceInfo>()
devices.AddRange(client.DiscoverDevices());

Once we have retrieved a list of available devices and selected your sphero from among them we need to connect to it. Note that we have to allow for a number of retries. This is because the connection process would otherwise, for unknown reasons, frequently fail. A succesful connection returns a NetworkStream which you will be able to write to and read from.

private NetworkStream Connect(BluetoothDeviceInfo device, int retries)
{
  BluetoothAddress addr = device.DeviceAddress;
  Guid serviceClass = BluetoothService.SerialPort;
  var ep = new BluetoothEndPoint(addr, serviceClass);
  for (int i = 0; i < retries; i++)
  {
    try
    {
      _client.Connect(ep);
      break;
    }
    catch (Exception ex)
    {
      Thread.Sleep(300);
      if (i == (retries - 1))
        throw new Exception(
        string.Format("Could not connect after {0} retries.", retries), ex);
    }
  }
  NetworkStream stream = _client.GetStream();
  return stream;
}

Data sent to the Sphero needs to be formatted into binary sequences i.e. packets that it can understand. There are three types of packets; command, response and asynchronous packets. The format for all three is quite similar which is why I chose to base all packets on a single abstract base class. The primary function of this class is to calculate and update the packet checksum as well as to allow access to the two fields (SOP1 and SOP2) common to all three packet types.

public abstract class SpheroPacket
{
  protected byte[] _data = null;
  public bool IsValid
  {
    get
    {
      return CalculatedChecksum.HasValue ?
       Checksum == CalculatedChecksum.Value :
       false;
    }
  }

  public byte[] Data
  {
    get
    {
      return _data;
    }
  }

  public byte SOP1
  { get { return _data[0]; } }

  public byte SOP2
  { get { return _data[1]; } }

  public byte Checksum
  {
    get { return _data[_data.Length - 1]; }
    set { _data[_data.Length - 1] = value; }
  }

  public byte? CalculatedChecksum
  { get { return GetChecksum(); } }

  public SpheroPacket()
  {
    _data = new Byte[] { 0xFF, 0xFF };
  }

  public void UpdateChecksum()
  {
    byte? checksum = GetChecksum();
    if (checksum.HasValue)
    {
      _data[_data.Length - 1] = checksum.Value;
    }
  }

  public byte? GetChecksum()
  {
    if (_data == null || _data.Length < 4) return null;
    uint sum = 0;
    for (int i = 2; i < _data.Length - 1; i++)
    {
      sum += _data[i];
    }
    return ((Byte)~(sum % 256));
  }

  public override string ToString()
  {
    const string invalid = "[invalid checksum!]->";
    byte[] data = Data;
    StringBuilder sb = new StringBuilder(data.Length * 3);
    if (!IsValid) sb.Append(invalid);
    foreach (var b in data)
    {
      sb.Append(string.Format("{0:X02}", b));
    }
    return sb.ToString();
  }
}

Much of what you might want to do with the Sphero can be accomplished by simply sending properly formatted command packets. The SpheroCommandPacket class below should be able to produce any command described in the API documentation if configured correctly through its constructor.

public class SpheroCommandPacket : SpheroPacket
{
  public byte DeviceId
  { 
    get 
    { 
      return _data[2]; 
    } 
    set
    {
      _data[2] = value; 
      UpdateChecksum(); 
    } 
  }

  public byte CommandId
  { 
    get 
    { 
      return _data[3]; 
    } 
    set
    {
      _data[3] = value;
      UpdateChecksum();
    }
  }

  public byte SequenceNumber
  { 
    get
    {
      return _data[4];
    }
    set
    {
      _data[4] = value;
      UpdateChecksum();
    }
  }

  public byte DataLength
  { 
    get
    {
      return _data[5];
    }
    set
    { 
      _data[5] = value;
      UpdateChecksum();
    }
  }

  public SpheroCommandPacket(
    byte deviceId, byte commandId, 
    byte sequenceNumber, byte[] data): base()
  {
    List<byte> list = new List<byte>();
    list.AddRange(_data);
    list.AddRange(new Byte[] { deviceId, commandId, sequenceNumber });
    if (data != null)
    {
      list.Add((byte)(data.Length + 1));
      list.AddRange(data);
    }
    else
    {
      list.Add(0x01);
    }
    list.Add(0xFF); // Placeholder for checksum
    _data = list.ToArray();
    UpdateChecksum();
  }
}

Obtaining properly formatted command packets is a simple matter of implementing a factory class for that purpose.

public static SpheroCommandPacket EraseOrbBasicStorage(StorageArea area)
{
  return new SpheroCommandPacket(0x02, 0x60, 0x01, new byte[] { (byte)area });
}

public static SpheroCommandPacket AppendOrbBasicFragment(StorageArea area, string fragment)
{
  List<byte> data = new List<byte>();
  byte[] fragBytes = Encoding.Default.GetBytes(fragment);
  data.Add((byte)area);
  data.AddRange(fragBytes);
  return new SpheroCommandPacket(0x02, 0x61, 0x01, data.ToArray());
}

public static SpheroCommandPacket ExecuteOrbBasicProgram(StorageArea area, UInt16 fromLine)
{
  byte[] data = new byte[3];
  data[0] = (byte)area;
  data[1] = (byte)((fromLine & 0xFF00) >> 8);
  data[2] = (byte)(fromLine & 0x00FF);
  return new SpheroCommandPacket(0x02, 0x62, 0x01, data);
}

public static SpheroCommandPacket AbortOrbBasicProgram()
{
  return new SpheroCommandPacket(0x02, 0x63, 0x01, null);
}

Below we send the assembled packet over the NetworkStream. Notice how we also assign a sequence number (one byte) to the packet. This is because the sequence number is echoed in any resulting response packets which will hopefully allow you to figure out which response belongs to which command. How you go about relating asynchronous packages from the Sphero to the commands that triggered them is another question for which I have no definitive answer at the moment. Asynchronous packets may be sent as a result of commands or may be sent as a result of some internal trigger in the Sphero (for example just before it goes asleep).

public void SendPacket(SpheroCommandPacket packet)
{
  packet.SequenceNumber = GetNextSequenceNumber();
  Stream.Write(packet.Data, 0, packetData.Length);
}

Sending an orbBasic program to the Sphero is now quite straightforward. The program below, once sent and executed, will make the Sphero glow a steady green as long as it is not disturbed. Once a preset acceleration threshold is reached it will pulse red for about 10 seconds. This is, admittedly, not a very exciting program but it is a good start.

10 RGB 0, 255, 0
20 if accelone > 1700 then goto 40
30 goto 20
40 for J = 1 to 8
50 for I = 0 to 255 step 10
60 RGB I, 0, 0
70 delay 50
80 next I
90 next J
100 goto 10

For a fuller picture of how you go about receiving and classifying incoming packages (response or async) please refer to the full SpheroNET source code.

For the future I have a number of improvements planned. Firstly I will spend some more time writing orbBasic code to make the Sphero behave like the robot it is, much to the delight, I would guess, of children and cats alike. Secondly all the commands listed in the API documentation should be implemented. An intuitive way of relating sent command packages with their resulting response packets (and asynchronous packets when possible) should also be found.

As always any feedback is welcome.

Amazing video of simulated galaxy formation

Wired UK recently ran a very interesting article about moving mesh cosmology. In a recent simulation of galaxy formation, billions of points of mass slowly acreate to form stars and galaxies, continuing their cosmic dance for some 14 billion simulated years. The software used to create this marvel is called Arepo (created by Dr. Volker Springel) and utilizes Voronoi tesselation.

One end result is the truly stunning animation below (the animation covers a “mere” 9 billion years).